Buffered Count-Min Sketch on SSD: Theory and Experiments

04/27/2018 ∙ by Mayank Goswami, et al. ∙ 0

Frequency estimation data structures such as the count-min sketch (CMS) have found numerous applications in databases, networking, computational biology and other domains. Many applications that use the count-min sketch process massive and rapidly evolving datasets. For data-intensive applications that aim to keep the overestimate error low, the count-min sketch may become too large to store in available RAM and may have to migrate to external storage (e.g., SSD.) Due to the random-read/write nature of hash operations of the count-min sketch, simply placing it on SSD stifles the performance of time-critical applications, requiring about 4-6 random reads/writes to SSD per estimate (lookup) and update (insert) operation. In this paper, we expand on the preliminary idea of the Buffered Count-Min Sketch (BCMS) [15], an SSD variant of the count-min sketch, that used hash localization to scale efficiently out of RAM while keeping the total error bounded. We describe the design and implementation of the buffered count-min sketch, and empirically show that our implementation achieves 3.7x-4.7x the speedup on update (insert) and 4.3x speedup on estimate (lookup) operations. Our design also offers an asymptotic improvement in the external-memory model [1] over the original data structure: r random I/Os are reduced to 1 I/O for the estimate operation. For a data structure that uses k blocks on SSD, was the word/counter size, r as the number of rows, M as the number of bits in the main memory, our data structure uses kwr/M amortized I/Os for updates, or, if kwr/M >1, 1 I/O in the worst case. In typical scenarios, kwr/M is much smaller than 1. This is in contrast to O(r) I/Os incurred for each update in the original data structure.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Applications that generate and process massive data streams are becoming pervasive [3, 20, 22, 16, 28] across many domain in computer science. Common examples of streaming datasets include financial markets, telecommunications, IP traffic, sensor networks, textual data, etc [3, 11, 29, 8]. Processing fast-evolving and massive datasets poses a challenge to traditional database systems, where commonly the application stores all data and subsequently does queries on it. In the streaming model [4], the dataset is too large to be completely stored in the available memory, so every data item is seen and processed once — an algorithm in this model performs only one scan of data, and uses sublinear local space.

The streaming scenario exhibits some limitations on the types of problems we can solve with such strict time and space constraints. A classic example is the heavy hitter problem HH(k) on the stream of pairs , where is the item identifier, and is the count of the item at timeslot , with the goal of reporting all items whose frequency is at least , . The general version of the problem, with the exception of when is a small constant111When this problem goes by the name of majority element., can not be exactly solved in the streaming model [25, 29], but the approximate version of the problem, -HH(k), where all items of the frequency at least

are reported, and an item with larger error might be reported with small probability

, is efficiently solved with the count-min sketch [12, 21] data structure. Count-min sketch accomplishes this in space, usually far below linear space in most applications.

Count-min sketch [12, 21] has been extensively used to answer heavy hitters, top queries and other popularity measure queries that represent the central problem in the streaming context, where we are interested in extracting the essence from an impractically large amount of data. Common applications include displaying the list of bestselling items, the most clicked-on websites, the hottest queries on the search engine, most frequently occurring words in a large text, and so on [27, 22, 30].

Count-min sketch (CMS) is a hashing-based, probabilistic and lossy representation of a multiset, that is used to answer the count of a query (number of times appears in a stream). It has two error parameters: 1) , which controls the overestimation error, and 2) , which controls the failure probability of the algorithm. The CMS provides the guarantee that the estimation error for any query is more than with probability at most . If we set and , the CMS is implemented using hash functions as a 2D array of dimensions x .

When and are constants, the total overestimate grows proportionately with , the size of the count-min sketch remains small, and the data structure easily fits in smaller and faster levels of memory. For some applications, however, the allowed estimation error of is too high when is fixed. Consider an example of , where and , hence the overestimate is 16, and the total data structure size of 3.36GB, provided each counter uses 4 bytes. However, if we double the dataset size, then the total overestimate also doubles to 32 if stays the same. On the other hand, if we want to maintain the fixed overestimate of 16, then the data structure size doubles to 6.72GB.

In this paper, we expand on the preliminary idea of Buffered Count-Min Sketch (BCMS) [15], an SSD variant of the count-min sketch data structure, that scales efficiently to large datasets while keeping the total error bounded. Our work expands on the previous work by introducing detailed design, implementation and experiments, as well as mathematical analysis of the new data structure (our original paper [15], which, to the best of our knowledge is the only attempt thus far to scale count-min sketch to SSD, contains only the outline of the data structure).

To demonstrate the issues arising from a growing count-min sketch and storing it in lower levels of memory, we run a mini in-RAM experiment for count-min sketch sizes 4KB-64MB. In Figure 1, we see that to maintain the same error, the cost of update will increase as the data structure is being stored in the lower levels of memory, even though we keep the number of hash functions fixed for all data structure sizes. The appropriate peak in the cost is visible at the border of L2 and L3 cache (at 3MB).

Figure 1: The effect of increasing count-min sketch size on the update operation cost in RAM.

Asymptotically, storing the unmodified count-min sketch on SSD or a disk is inefficient, given that each estimate and update operation needs hashes, which results in random reads/writes to SSD, far below the desired throughput for most time-critical streaming applications.

Another context where we see CMS becoming large even when is fixed is in some text applications,where the number of elements inserted in the sketch is quadratic in the original text size. For instance, [19] uses CMS to record distributional similarity on the web, where each pair of words is inserted as a single item into the CMS, and 90GB of text requires a CMS of 8GB.

1.1 Results

  1. We describe the design and implementation of buffered count-min sketch, and empirically show that our implementation achieves 3.7-4.7x the speedup on update (insert) and 4.3x speedup on estimate (lookup) operations.

  2. Our design also offers an asymptotic improvement in the external-memory model [1] over the original data structure: random I/Os are reduced to 1 I/O for estimate. For a data structure that uses blocks on SSD, as the word/counter size, as the number of rows, as the number of bits in main memory, our data structure uses amortized I/Os for updates, or, if , 1 I/O in the worst case. In typical scenarios, . This is in contrast to I/Os incurred for each update in the original data structure.

  3. We mathematically show that for buffered count-min sketch, the error rate does not substantially degrade over the original count-min sketch. Specifically, we prove that for any query , our data structure provides the following guarantee:

We focus on scenarios where the allowed estimation error is sublinear in . For example, what if we want the estimation error to be no larger than , or ? These scenarios correspond to or , and now for even moderately large values of , the count-min sketch becomes too large to fit in main memory. Even given more modest condition, such as , where the memory is of size , the count-min sketch is unlikely to fit in memory. We will assume that . Higher values of do not require the count-min sketch to be placed on disk, and lower values of mean exact counts are desired.

2 Related Work

The streaming model represents many real-life situations where the data is produced rapidly and on a constant basis. Some of the applications include sensor networks [22], monitoring web traffic [26], analyzing text [19], and monitoring satellites orbiting the Earth [18].

Heavy hitters, top

queries, iceberg queries, and quantiles  

[28, 22, 3] are some of the most central problems in the streaming context, where we wish to extract the general trends from a massive dataset. Count-Min sketch has proved useful in such contexts for its space-efficiency and providing accurate counts [12, 20].

Count-Min sketch can be well illustrated using its connection to the Bloom filter [6, 9, 7]. Both data structures are lossy and space-efficient representations of sets, used to reduce disk accesses in time-critical applications. Bloom filter answers membership queries and can have false positives, while Count-Min sketch answers frequency queries, and can overestimate the actual frequency count. Both data structures are hashing-based, and suffer from similar issues when placed directly to SSD or a magnetic disk.

There has been earlier attempts to scale Bloom filters to SSD using buffering and hash localization [10, 13]. Our paper employs similar methods to those in [10, 13]. The improvement, both in our case and in the case of Buffered Bloom filter [10] is achieved at the expense of having an extra hash function that helps determine to which page each element is going to hash.

There has also been work in designing cache-efficient equivalents for Bloom filters such as quotient filter and write-optimized on-disk quotient filter such as Cascade filter (CQF) [5, 14, 23]. An important distinction to make between these data structures and count-min sketch is that CQF gives exact counts of most of the elements given that the errors caused by false positives are usually very small. However, since the errors are independent, the CQF doesn’t offer any guarantees on the overestimate. For example, two highly occurring elements in a multi-set can collide with each other and both will have large overcounts. On the other hand, the CMS does not give exact counts of elements due to multiple hashes and its size (width of the CMS is smaller than the number of slots in a CQF). But the CMS can offer a guarantee that overestimate will be smaller than with a probability of .

2.1 Count-Min Sketch: Preliminaries

In the streaming model, we are given a stream of pairs , where denotes the item identifier (e.g., IP address, stock ID, product ID), and denotes the count of the item (e.g., the number of bytes sent from the IP address, the amount by which a stock has risen/fallen or the number of sold items). Each pair is an item within a stream of length , and the goal is to record total sum of frequencies for each particular item .

For a given estimation error rate and failure probabiltity , define and . The Count-Min Sketch is represented via 2D table with buckets (columns), rows, implemented using hash functions (one hash function per row).

CMS has two operations: UPDATE() and ESTIMATE(), the respective equivalents of insert and lookup, and they are performed as follows:

  • UPDATE() inserts the pair by computing hash functions on , and incrementing appropriate slots determined by the hashes by the quantity . That is, for each hash function , , we set . Note that in this paper, we use , so every time an item is updated, it is just incremented by 1.

  • ESTIMATE() reports the frequency of which can be an overestimate of the true frequency. It does so by calculating hashes and taking the minimum of the values found in appropriate cells. In other words, we return . Because different elements can hash to the same cells, the count-min sketch can return the overestimated (never underestimated) value of the count, but in order for this to happen, a collision needs to occur in each row. The estimation error is bounded; the data structure guarantees that for any particular item, the error is within the range , with probability at least , i.e., .

3 Buffered Count-Min Sketch

In this section, we describe Buffered Count-Min Sketch, an adaptation of CMS to SSD. The traditional CMS, when placed on external storage, exhibits performance issues due to random-write nature of hashing. Each update operation in CMS requires writes to different rows and columns of CMS. On a large data structure, these writes become destined to different pages on disk, causing the update to perform random SSD page writes. For high-precision CMS scenarios where , this can be between 5-7 writes to SSD, which is unacceptable in a high-throughput scenario.

To solve this problem, we implement, analyze and empirically test the data structure presented in [15] that outlines three adaptations to the original data structure:

  1. Partitioning CMS into pages and column-first layout: We logically divide the CMS on SSD into pages of block size . CMS with rows, columns, cell size , and a total of -bit counters, contains pages , where and each page spans contiguous columns: spans columns . To improve cache-efficiency, CMS is laid out on disk in column-first fashion, which allows each logical page to be laid out sequentially in memory. Thus, each read/write of a logical page requires at most I/Os.

  2. Hash localization: We direct all hashes of each element to a single logical page of CMS that is determined by an additional hash function . The subsequent hash functions map to the columns inside the corresponding logical page, i.e., the range of for an element is . This way, we direct all updates and reads related to one element to one logical page.

  3. Buffering: When an update operation occurs, the hashes produced for an element are first stored inside an in-memory buffer. The buffer is partitioned into sub-buffers of equal size , and they directly correspond to logical pages on disk in that stores the hashes for updates destined for page . Each element first hashes using , which determines in which sub-buffer the hashes will be temporarily stored for this element. Once the sub-buffer becomes full, we read the page from the CMS, apply all updates destined for that page, and write it back to disk. The capacity of a sub-buffer is hashes, which is equivalent to elements so the cost of an update becomes I/O.

Figure 2: UPDATE operation on Buffered Count-Min Sketch. Updates are stored in RAM, and all updates are destined for the same block on disk.
1 Require: key, r
2 subbufferIndex$_i$ :=murmur$_0$(key);
3 for i:=1 to r do
4    hashes[i] :=murmur$_i$(key);
5 end for
6 AppendToBuffer(hashes,subbufferIndex);
7
8 if isSubbufferFull(subbufferIndex) then
9    bcmsBlock :=readDiskPage(subbufferIndex);
10    for each entry in Subbuffer[subbufferIndex] do
11        for each index in entry do
12            pageStart :=calculatePageStart(subbufferIndex);
13            offset :=pageStart + entry[index];
14            bcmsBlock[offset][index]++;
15        end for
16    end for
17    writeBcmsPageBackToDisk(bcmsBlock);
18    clearBuffer(subbufferIndex);
19 end if
Algorithm 1: Buffered Count-Min Sketch - UPDATE function

The pseudocode for UPDATE() is shown in Algorithm LABEL:list1, and for ESTIMATE() in Algorithm LABEL:list2. We use murmurhash as our hashing algorithm due to its efficiency and simplicity [2]. Unlike UPDATE(), ESTIMATE() operation is not buffered. In a related work [10] that implements a buffered Bloom filter on SSD, the data structure buffers lookups. However in the count-min sketch scenario, buffering for ESTIMATE() is unproductive given that even if the item is found in the buffer, we still need to check the CMS page to obtain the correct count. Therefore, our ESTIMATE() is optimized for the worst-case single lookup scenario and works for solely insert/lookup as well as mixed workloads. The ESTIMATE() also first computes the correct sub-buffer using , and flushes the corresponding sub-buffer to SSD page in case some updates were present. Once it applies the necessary changes to the page, it reads the corresponding CMS cells specified by hashes and returns the minimum value.

1 Require: key, k
2 subbufferIndex$_i$ :=murmur$_0$(key);
3 pageStart :=calculatePageStart(subbufferIndex);
4 bcmsBlock :=readDiskPage(subbufferIndex);
5
6  if isSubbufferNotEmpty(subbufferIndex) then
7    for each entry in Subbuffer[subbufferIndex] do
8        for each index in entry do
9            offset :=pageStart + entry[index];
10            bcmsBlock[offset][index]++;
11        end for
12    end for
13    clearBuffer(subbufferIndex);
14 end if
15
16 for i:=1 to k do
17    value :=murmur$_i$(key);
18    offset :=pageStart + value;
19    estimation :=bcmsBlock[offset][i - 1];
20    estimates[i] :=estimation;
21 end for
22 writeBcmsPageBackToDisk(bcmsBlock);
23 return min(estimates)
Algorithm 2: Buffered Count-Min Sketch - ESTIMATE function

4 Analysis of Buffered Count-Min Sketch

In this section, we show that the buffering and hash localization do not substantially degrade the error guarantee of the buffered count-min data structure. Fix a failure probability and let be the function of controlling the estimation error. Let and . The traditional count-min sketch uses counters/words of space. Recall that for our purposes, .

Let be the number of blocks occupied by the buffered count-min sketch. We assume a block can hold counters. Our analysis will assume the following mild conditions:

Assumption 1: is sufficiently larger than the number of blocks , suffices. Since depends inversely on , this assumption essentially means that . Assumption 2: .

Both conditions are satisfied, e.g., when or for any .

For brevity, we will drop the dependence of on , and write the error rate as just , however it is important to note that is not a constant.

The Buffered-Count-Min-Sketch is a data structure that uses blocks of space on disk and for any query ,

  • returns ESTIMATE() in 1 I/O and performs updates in I/Os amortized, or, if , in one I/O worst case.

  • Let Error(q) = ESTIMATE(q) - TrueFrequency(q). Then for any ,

Remark: By assumption , is (in fact, it is ). By assumption , is . Thus we claim that the buffered count-min-sketch gives almost the same guarantees as a traditional count-min sketch, while obtaining a factor speedup in queries.The guarantee for estimates taking 1 I/O is apparent from construction, as only one block needs to be loaded222In practice, we may need 2 I/Os sometimes due to block-page alignment, but never more than 2.

The proof is a combination of the classical analysis of CMS and the maximum load of balls in bins when the number of bins is much smaller than the number of balls. Also, note that unlike the traditional CMS, the errors for a query in different rows are no longer independent (in fact, they are positively correlated: a high error in one row implies more elements were hashed by to the same bucket as ).

The hash function maps into buckets, each having size (and so we will also call them blocks). Each bucket can be thought of as a matrix. Note that , and . We assume that is a perfectly random hash function, and, abusing notation, identify a bucket/block with a bin, where assigns elements (balls) to one of the buckets (bins).

In this scenario we use Lemma 2(b) from [24] and adapt it to our setting.

[24] Let

denote a Binomial distribution with parameters

and , and . If and tends to infinity, then

Let denote the maximum number of elements that fall into a bucket, when hashed by .

Let and . Then

Proof.

We first check that satisfies the conditions of Lemma 4. Since is uniform, (i.e., each bucket is equally probable), and . We need to check that the extra term in , is . This is precisely the condition that (assumption ).

Next we apply Lemma 4. In our case,

Now by assumption , goes to zero as goes to infinity, and so goes to infinity, and therefore goes to infinity as goes to infinity. Thus we have that the number of elements in any particular bucket (which follows a distribution) is larger than with probability . Putting in , we get , and thus the probability is at most .

Thus the probability that the maximum number of balls in a bin is more than is bounded (by the union bound) by , and the lemma is proved. ∎

Now that we know that with probability as least , no bucket has more than elements, we observe that a bucket serves as a “mini” CMS for the elements that hash to it. In other words, let be the number of elements that hash to the same bucket as under . The expected error in the th row of the mini-CMS for (the entry for which is contained inside the bucket of ), is .

By Markov’s inequality .

Let . We now compute the bound on the final error (after taking the min) as follows.

where the second last equality follows from Markov’s inequality on and Lemma 4. Finally, by observing that for a fixed , , the proof of the theorem is complete.

5 Evaluation

In this section, we evaluate our implementation of the buffered count-min sketch. We compare the buffered count-min sketch against the (traditional) count-min sketch. We evaluate each data structure on two fundamental operations, insertions and queries. We evaluate queries for set of elements chosen uniformly at random.

In our evaluation, we address the following questions about how the performance of buffered count-min sketch compares to the count-min sketch:

  • How does the insertion throughput in buffered count-min sketch compare to count-min sketch on SSD?

  • How does the query throughput in buffered count-min sketch compare to count-min sketch on SSD?

  • How does the hash localization in buffered count-min sketch affect the overestimates compared to the overestimates in count-min sketch?

5.1 Experimental setup

To answer the above questions, we evaluate the performance of the buffered count-min sketch and the count-min sketch on SSD by scaling the sketch out-of-RAM. For SSD benchmarks, we use four different RAM-size-to-sketch-size ratios, , , , and . The RAM-size-to-sketch-size ratio is the ratio of the size of the available RAM and the size of the sketch on SSD. We fix the size of the available RAM to be MB and change the size of the sketch to change the RAM-size-to-sketch-size ratio. The page size in all our benchmarks was set to B. In all the benchmarks, we measure the throughput (operations per second) to evaluate the insertion and query performance.

To measure the insertion throughput, we first calculate the number of elements we can insert in the sketch using calculations described in Section 5.2. During an insert operation, we first generate a -bit integer from a uniform-random distribution and then add that integer to the sketch. This way, we do not use any extra memory to store the set of integers to be added to the sketch. We then measure the total time taken to insert the given set of elements in the sketch. Note that for the buffered count-min sketch, we make sure to flush all the remaining inserts from the buffer to the sketch on SSD at the end and include the time to do that in the total time.

To measure the query throughput, we query for elements drawn from a uniform-random distribution and measure the throughput. The reason for the query benchmark is to simulate a real-world query workload where some elements may not be present in the sketch and the query will terminate early thereby requiring fewer I/Os.

For all the query benchmarks, we first perform the insertion benchmark and write the sketch to SSD. After the insertion benchmark, we flush all caches (page cache, directory entries, and inodes). We then map the sketch back into RAM and perform queries on the sketch. This way we make sure that the sketch is not already cached in kernel caches from the insertion benchmark.

We compare the overestimates in buffered count-min sketch and count-min sketch for all the four sketch sizes for which we perform insertion and query benchmarks. To measure the overestimates, we first perform the insertion benchmark. However, during the insertion benchmark, we also store each inserted element in a multiset. Once insertions are done, we iterate over the multiset and query for each element in the sketch. We then take the difference of the count returned from the sketch and the actual count of the element to calculate the overestimate.

For SSD-based experiments, we allocate space for the sketch by mmap-ing it to a file on SSD. We then control the available RAM to the benchmarking process using cgroups. We fix the RAM size for all the experiments to be MB. We then increase the size of the sketch based on the RAM-size-to-sketch-size ratio of the particular experiment. For the buffered count-min sketch, we use all the available RAM as the buffer. Paging is handled by the operating system based on the disk accesses. The point of these experiments is to evaluate the I/O efficiency of sketch operations.

All benchmarks were performed on a -bit Ubuntu 16.04 running Linux kernel 4.4.0-98-generic. The machine has Intel Skylake CPU U (Core(TM) i7-6700HQ CPU @ GHz with cores and MB L cache) with GB RAM and TB Toshiba SSD.

Size Width Depth #elements
128MB 3355444 5 9875188
256MB 6710887 5 19750377
512MB 13421773 5 39500754
1GB 26843546 5 79001508
Table 1: Size, width, and depth of the sketch and the number of elements inserted in count-min sketch and buffered count-min sketch in our benchmarks (insertion, query, and overestimate calculation).

5.2 Configuring the sketch

In our benchmarks, we take as input , overestimate (), and the size of the sketch to configure the sketch . The depth of the count-min sketch is . The number of cells is . And width of the count-min sketch is .

Given these parameters, we calculate the number of elements to be inserted in the sketch as . In all our experiments, we set to and maximum overestimate to and change the sketch size. Table 1 shows dimensions of the sketch and number of elements inserted based on the size of the sketch.

Figure 3: Insert throughput of count-min sketch and buffered count-min sketch with increasing sizes. The available RAM is fixed to MB. With increasing sketch sizes (on x-axis) the RAM-size-to-sketch-size is also increasing , , , and . (Higher is better)
Figure 4: Query throughput of count-min sketch and buffered count-min sketch with increasing sizes. The available RAM is fixed to MB. With increasing sketch sizes (on x-axis) the RAM-size-to-sketch-size is also increasing , , , and . (Higher is better)
Figure 5: Maximum overestimate reported by count-min sketch and buffered count-min sketch for any inserted element for different sketch sizes. The blue line represents the average overestimate reported by count-min sketch and buffered count-min sketch for all the inserted elements. The average overestimate is same for both count-min sketch and buffered count-min sketch.

5.3 Insert Performance

Figure 3 shows the insert throughput of count-min sketch and buffered count-min sketch with changing RAM-size-to-sketch-size ratios. buffered count-min sketch is faster compared to the count-min sketch in terms of insert throughput on SSD.

The buffered count-min sketch performs less than one I/O per insert operation because all the hashes for a given element are localized to a single page on SSD. However, in the count-min sketch the hashes for a given element are spread across the whole sketch. Therefore, the insert throughput of the buffered count-min sketch is when the sketch is twice the size of the RAM. And the difference in the throughput increases as the sketch gets bigger and RAM size stays the same.

5.4 Query Performance

Figure 4 shows the query throughput of count-min sketch and buffered count-min sketch with changing RAM-size-to-sketch-size ratios. buffered count-min sketch is faster compared to the count-min sketch in terms of query throughput on SSD.

The buffered count-min sketch performs a single I/O per query operation because all the hashes for a given element are localized to a single page on SSD. In comparison, count-min sketch may have to perform as many as I/Os per query operation, where is the depth of the count-min sketch.

5.5 Overestimates

In Figure 5 we empirically compare overestimates returned by the count-min sketch and buffered count-min sketch for all the four sketch sizes for which we performed insert and query benchmarks. And we found that the average and the maximum overestimate returned from count-min sketch and buffered count-min sketch are exactly the same. This shows that empirically hash localization in buffered count-min sketch does not have any major effect on the overestimates.

6 Conclusion

In this paper we described the design and implementation of the Buffered count-min sketch, and empirically showed that our implementation achieves the speedup on update (insert) and speedup on estimate (lookup) operations. Queries take I/O, which is optimal in the worst case if not allowed to buffer. However, we do not know whether the update time is optimal. To the best of our knowledge, no lower bounds on the update time of such a data structure are known (the only known upper bounds are on space, e.g., in [17]). We leave the question of deriving update lower bounds and/or a SSD-based data structure with faster update time for future work.

References

  • [1] Alok Aggarwal and S. Vitter, Jeffrey. The input/output complexity of sorting and related problems. Commun. ACM, 31(9):1116–1127, September 1988. URL: http://doi.acm.org/10.1145/48529.48535, doi:10.1145/48529.48535.
  • [2] Austin Appleby. 32-bit variant of murmurhash3, 2011. URL: https://sites.google.com/site/murmurhash/.
  • [3] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’02, pages 1–16, New York, NY, USA, 2002. ACM. URL: http://doi.acm.org/10.1145/543613.543615, doi:10.1145/543613.543615.
  • [4] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1–16. ACM, 2002.
  • [5] Michael A. Bender, Martin Farach-Colton, Rob Johnson, Russell Kraner, Bradley C. Kuszmaul, Dzejla Medjedovic, Pablo Montes, Pradeep Shetty, Richard P. Spillane, and Erez Zadok. Don’t thrash: How to cache your hash on flash. Proc. VLDB Endow., 5(11):1627–1637, July 2012. URL: http://dx.doi.org/10.14778/2350229.2350275, doi:10.14778/2350229.2350275.
  • [6] Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422–426, July 1970. URL: http://doi.acm.org/10.1145/362686.362692, doi:10.1145/362686.362692.
  • [7] Flavio Bonomi, Michael Mitzenmacher, Rina Panigrahy, Sushil Singh, and George Varghese. An improved construction for counting bloom filters. In Proceedings of the 14th Conference on Annual European Symposium - Volume 14, ESA’06, pages 684–695, London, UK, UK, 2006. Springer-Verlag. URL: http://dx.doi.org/10.1007/11841036_61, doi:10.1007/11841036_61.
  • [8] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. Web caching and zipf-like distributions: Evidence and implications. In INFOCOM ’99. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies, pages 126–134, 1999.
  • [9] Andrei Broder, Michael Mitzenmacher, and Andrei Broder I Michael Mitzenmacher. Network applications of bloom filters: A survey. In Internet Mathematics, pages 636–646, 2002.
  • [10] Mustafa Canim, George A. Mihaila, Bishwaranjan Bhattacharjee, Christian A. Lang, and Kenneth A. Ross. Buffered bloom filters on solid state storage. In Rajesh Bordawekar and Christian A. Lang, editors, ADMS@VLDB, pages 1–8, 2010. URL: http://dblp.uni-trier.de/db/conf/vldb/adms2010.html#CanimMBLR10.
  • [11] Aiyou Chen, Yu Jin, Jin Cao, and Li Erran Li. Tracking long duration flows in network traffic. In Proceedings of the 29th Conference on Information Communications, INFOCOM’10, pages 206–210, Piscataway, NJ, USA, 2010. IEEE Press. URL: http://dl.acm.org/citation.cfm?id=1833515.1833557.
  • [12] Graham Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. J. Algorithms, 55(1):58–75, April 2005. URL: http://dx.doi.org/10.1016/j.jalgor.2003.12.001, doi:10.1016/j.jalgor.2003.12.001.
  • [13] Biplob Debnath, Sudipta Sengupta, Jin Li, David J. Lilja, and David H. C. Du. Bloomflash: Bloom filter on flash-based storage. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems, ICDCS ’11, pages 635–644, Washington, DC, USA, 2011. IEEE Computer Society. URL: http://dx.doi.org/10.1109/ICDCS.2011.44, doi:10.1109/ICDCS.2011.44.
  • [14] Sourav Dutta, Ankur Narang, and Suman K. Bera. Streaming quotient filter: A near optimal approximate duplicate detection approach for data streams. Proc. VLDB Endow., 6(8):589–600, June 2013. URL: http://dx.doi.org/10.14778/2536354.2536359, doi:10.14778/2536354.2536359.
  • [15] Ehsan Eydi, Dzejla Medjedovic, Emina Mekic, and Elmedin Selmanovic. Buffered count-min sketch. In Mirsad Hadžikadić and Samir Avdaković, editors, Advanced Technologies, Systems, and Applications II, pages 249–255, Cham, 2018. Springer International Publishing.
  • [16] Mohamed Medhat Gaber, Arkady Zaslavsky, and Shonali Krishnaswamy. Mining data streams: A review. SIGMOD Rec., 34(2):18–26, June 2005. URL: http://doi.acm.org/10.1145/1083784.1083789, doi:10.1145/1083784.1083789.
  • [17] Sumit Ganguly. Lower bounds on frequency estimation of data streams. In International Computer Science Symposium in Russia, pages 204–215. Springer, 2008.
  • [18] Michael Gertz, Quinn Hart, Carlos Rueda, Shefali Singhal, and Jie Zhang. A data and query model for streaming geospatial image data. In Torsten Grust, Hagen Höpfner, Arantza Illarramendi, Stefan Jablonski, Marco Mesiti, Sascha Müller, Paula-Lavinia Patranjan, Kai-Uwe Sattler, Myra Spiliopoulou, and Jef Wijsen, editors, Current Trends in Database Technology – EDBT 2006, pages 687–699, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg.
  • [19] Amit Goyal, Jagadeesh Jagarlamudi, Hal Daumé, III, and Suresh Venkatasubramanian. Sketch techniques for scaling distributional similarity to the web. In Proceedings of the 2010 Workshop on GEometrical Models of Natural Language Semantics, GEMS ’10, pages 51–56, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. URL: http://dl.acm.org/citation.cfm?id=1870516.1870524.
  • [20] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th International Conference on Very Large Data Bases, VLDB ’02, pages 346–357. VLDB Endowment, 2002. URL: http://dl.acm.org/citation.cfm?id=1287369.1287400.
  • [21] Muthu Muthukrishnan and Graham Cormode. Approximating data with the count-min sketch. volume 29, pages 64–69, Los Alamitos, CA, USA, 10 2011. IEEE Computer Society Press. URL: doi.ieeecomputersociety.org/10.1109/MS.2011.127, doi:10.1109/MS.2011.127.
  • [22] Suman Nath, Phillip B. Gibbons, Srinivasan Seshan, and Zachary R. Anderson. Synopsis diffusion for robust aggregation in sensor networks. In Proceedings of the 2Nd International Conference on Embedded Networked Sensor Systems, SenSys ’04, pages 250–262, New York, NY, USA, 2004. ACM. URL: http://doi.acm.org/10.1145/1031495.1031525, doi:10.1145/1031495.1031525.
  • [23] Prashant Pandey, Michael A. Bender, Rob Johnson, and Robert Patro. A general-purpose counting filter: Making every bit count. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 775–787, 2017. URL: http://doi.acm.org/10.1145/3035918.3035963, doi:10.1145/3035918.3035963.
  • [24] Martin Raab and Angelika Steger. “balls into bins”—a simple and tight analysis. Randomization and Approximation Techniques in Computer Science, pages 159–170, 1998.
  • [25] Tim Roughgarden and Gregory Valiant. Cs168: The modern algorithmic toolbox lecture #2: Approximate heavy hitters and the count-min sketch, 2018.
  • [26] Tamás Sarlós, Adrás A. Benczúr, Károly Csalogány, Dániel Fogaras, and Balázs Rácz. To randomize or not to randomize: Space optimal summaries for hyperlink analysis. In Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pages 297–306, New York, NY, USA, 2006. ACM. URL: http://doi.acm.org/10.1145/1135777.1135823, doi:10.1145/1135777.1135823.
  • [27] Stuart Schechter, Cormac Herley, and Michael Mitzenmacher. Popularity is everything: A new approach to protecting passwords from statistical-guessing attacks. In Proceedings of the 5th USENIX Conference on Hot Topics in Security, HotSec’10, pages 1–8, Berkeley, CA, USA, 2010. USENIX Association. URL: http://dl.acm.org/citation.cfm?id=1924931.1924935.
  • [28] David P. Woodruff. New algorithms for heavy hitters in data streams. CoRR, abs/1603.01733, 2016. URL: http://arxiv.org/abs/1603.01733.
  • [29] Yin Zhang, Sumeet Singh, Subhabrata Sen, Nick Duffield, and Carsten Lund. Online identification of hierarchical heavy hitters: Algorithms, evaluation, and applications. In Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, IMC ’04, pages 101–114, New York, NY, USA, 2004. ACM. URL: http://doi.acm.org/10.1145/1028788.1028802, doi:10.1145/1028788.1028802.
  • [30] Qi (George) Zhao, Mitsunori Ogihara, Haixun Wang, and Jun (Jim) Xu. Finding global icebergs over distributed data sets. In Proceedings of the Twenty-fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’06, pages 298–307, New York, NY, USA, 2006. ACM. URL: http://doi.acm.org/10.1145/1142351.1142394, doi:10.1145/1142351.1142394.