Quotient Hash Tables - Efficiently Detecting Duplicates in Streaming Data

01/14/2019 ∙ by Rémi Géraud, et al. ∙ 0

This article presents the Quotient Hash Table (QHT) a new data structure for duplicate detection in unbounded streams. QHTs stem from a corrected analysis of streaming quotient filters (SQFs), resulting in a 33% reduction in memory usage for equal performance. We provide a new and thorough analysis of both algorithms, with results of interest to other existing constructions. We also introduce an optimised version of our new data structure dubbed Queued QHT with Duplicates (QQHTD). Finally we discuss the effect of adversarial inputs for hash-based duplicate filters similar to QHT.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

We consider in this paper the following problem: given a possibly infinite stream of symbols, detect whether a given symbol appeared somewhere in the stream. It turns out that instances of this duplicate detection problem arise naturally in many applications: backup systems  [11] or Web caches [10], search engine databases or click counting in web advertisement  [16] , retrieval algorithms [3], data stream management systems [1], or even cryptographic contexts where nonce-reuse is problematic111A nonce is a one-time use random number. [15].

It is generally not possible to store the whole stream in memory, therefore practical solutions to this problem must somehow trade off memory for accuracy. The special case of a single duplicate has a known optimal solution [13, 14], but to the best of the authors knowledge no such results exists for detecting all duplicates in an unbounded stream.

Several algorithms were proposed to address this specific question, which we discuss below. Of particular interest to our investigation are Streaming Quotient Filter (SQF), as described by Dutta et al. in [8]. We point out several crucial mistakes in the original analysis of SQF, and in doing so highlight that a more efficient data structure can be constructed along the same lines. We flesh out such a data structure, which we call the Quotient Hash Table (QHT), and provide a thorough analysis of both SQF and QHT. This analysis is our main contribution. QHT itself is not optimal and can be further improved: we describe such an improvement, dubbed Queued QHT with Duplicates (QQHT), and benchmark our new algorithms against popular alternatives.

2 Preliminaries and related work

2.1 Duplicate Detection

Let be a (possibly infinite) sequence of elements , and write . An element from is a duplicate if such as . Otherwise, is unseen.222Note that by definition is always unseen.

The duplicate detection problem is the question of finding all duplicates in a stream .333Note that this problem is equivalent to a dynamic formulation of the approximate set membership problem. We recall the following well-known result:

Theorem 2.1.

Assume that each is sampled uniformly at random from . Then perfect detection requires memory bits.

Proof.

A perfect duplicate filter must be able to store all streams , i.e. must be able to store any subset of , which we denote by . Given that there are of them, according to information theory any such filter requires at least bits of storage.

Because of this result, perfect duplicate detection is often out of reach when is big — however probabilistic solutions are often sufficient for many applications. Such algorithms make errors: false positives (claiming a duplicate where there isn’t) and false negative (missing a duplicate).

2.2 Filters

Definition 2.1 (Filter).

A filter over the memory is a tuple , where:

  • is the current state

Here DUPLICATE corresponds to a guess that the provided element is duplicate, and UNSEEN that it is unseen. Insert corresponds to an update of the filter’s memory state after observing a new element. Here models the amount of memory (states) available to the algorithm.

In practice, Detect and Insert are often merged into a single algorithm .

Definition 2.2 (False positive (resp. negative)).

If, for an unseen (resp. duplicate) element , outputs DUPLICATE (resp. UNSEEN), is called a false positive (resp. negative).

Definition 2.3 (Fpr, Fnr).

The false positive rate (FPR) of a stream is the frequency of false positive. The false negative rate is similarly defined as the frequency of false negatives.

We are interested in filters that whose FNR and FPR can be kept low when is bounded.

2.3 Hash-based filters

Filters rely on hashing to efficiently answer the duplicate detection problem. These filters are often a variation over the well-known Bloom filters [2]. A Bloom filter uses an array of bits, initially all set to . hash functions are also needed. Insertion of an element is made by setting all to . Detection of an element is the AND of all cells : if the AND value is , then emit UNSEEN, otherwise emit DUPLICATE.

It is easy to see that this approach has a FNR of 0. However, as the stream grows, the FPR gets worse, and in the limit of an infinite stream the FPR is 1. Many variants have been proposed in the literature to compensate for this effect [20], mostly by allowing deletion [5, 6], but doing so increases the FNR.

Alternatively, other filters prefer to store fingerprints: this is the rationale behind several recent constructions [8, 9].

For the data structures most interesting to our question, we refer the reader to the algorithms described in [7, 9, 19, 8, 22].444Cuckoo filters [9] requires a minimal adaptation for unbounded streams: in the original paper, failure is emitted after some number of relocations; we just discard the failure. Theoretical analysis of this new structure is not addressed in this paper.

Other data structures, such as [4, 12] are not considered in this paper, as they require an unbounded amount of memory as the stream grows.

2.4 Streaming Quotient Filter

One construction which we must describe at length is the Streaming Quotient Filter (SQFs) [8]. Given an element , a certain fingerprint555[8] refers to them as “signatures”, but we shun this term to avoid any claim of cryptographic properties. is stored in an array, at row and a certain column amongst . This array constitutes the filter’s state.

The filter’s construction uses integers , , and a hash function . The filter’s state is an array of rows and columns, each holding a -bit element (with ), or the special empty symbol . The filter’s state is initially in every cell.

Then, [8] describes as follows:

  • Compute . The most significant bits of define , and the least significant bits define .

  • Let be the Hamming weight of , and let be bits deterministically chosen666E.g., the least significant bits of . from . Let where denotes concatenation.

  • If is already stored in the row numbered , emit DUPLICATE.

  • Otherwise, store it in one empty cell of that row. If no empty cell exists in the row, store in one random cell of the row, replacing any fingerprint previously stored there. Emit UNSEEN.

3 Revisiting SQF: Quotient Hash Table

SQF is introduced and analysed in [8]. However, a careful reading reveals several crucial mistakes in that analysis. We focus here on two of them that directly impact the claim of SQF near-optimality. Note that here, as in the rest of this paper, hash functions are modelled as pseudo-random functions.

  1. Cells in the filter’s state are not independent: in particular, in every row, non-empty cells hold different values (by design);

  2. Terms in geometric sums of order cannot be neglected.

Taking these effects into account (see Sections 4.1 and 4.2), and using the same approximation than [8] used777Even though not mentioned in [8], the approximation (stemming from Catalan numbers) is only asmptotatically valid, but computations show that the approximation is correct even for small values of , such as the ones used in practical applications., , we get the following asymptotic formulae for the FNR and FPR

that disagree with [8] — most importantly, the error rates do not decrease to , contrarily to what was claimed. Interestingly, the suggested parameters (, , ) indeed achieve an FNR of — but also an FPR of 1.888With these values, only distinct fingerprints exist, and they can all be stored in the cells per row. When the filter is full, any duplicate will be reported as such, hence a FNR of . But every new element will also be reported as duplicate, hence a FPR of .

Further more, there is redundancy between the Hamming weight of and the reduced remainder . For instance, if contains at least one bit set to , we know that . Intuitively this means SQF is wasteful, and we could expect to avoid collisions in the filter’s state by using a better adjusted encoding: this intuition happens to be correct as shown in Section 4.4.

Finally, since the state table contains rows, with cell each, the total memory required by an SQF is .

3.1 Full-size Hashing, Memory Adjustments

Our first observation is that SQF’s fingerprint scheme (a hash and a Hamming weight) can be fruitfully replaced by a single hash function of the same size. Not only does this simplify the theoretical analysis, it also provides a much more efficient use of the available space. We also use more flexible hash functions, that give much more flexibility in adjusting the total memory of the filter. Combining these two effects, we obtain Algorithm 1 which we call Quotient Hash Table (QHT).

1:function Setup() and
2:     Let .
3:     Choose a hash function over
4:     Choose a hash function over
5:     Let be a array with -bit cells, initialized to .
1:function Stream()
2:     for each cell in row  do
3:         if   then
4:              return DUPLICATE               
5:     Let be the first empty cell in row .
6:     if  does not exist then
7:               
8:     Store in bucket
9:     return UNSEEN
Algorithm 1 QHT Setup and Stream

3.2 Empty Cells

SQF and QHT as described above make essential use of the “empty” cells in . The need for this feature is present for all fingerprint-based structures including Cuckoo filters [9].999Other constructions do not face this issue, including SBF [7] and b_DBF [19]: in these schemes, always codes for absence. However, a low-level implementation cannot rely on the availability of such a special value. Our options are to initialise all cells to and either

  1. treat as a fingerprint;

  2. if an element has a fingerprint of , reassign its fingerprint to ;

  3. while an element has a fingerprint , compute a new fingerprint based on some deterministic scheme.

The first option is at a risk of a high false positive rate, even for small streams: when a new element, whose fingerprint is

, should be stored in an empty bucket, it is instead dismissed as a duplicate. More specifically, before the filter is completely filled, a new element has probability at least

to be false positive, which leads to a high number of false positive at the beginning of any stream.

The second option is significantly faster than the third101010The third option needs, on average, hash computations for each insertion, whereas the second option only needs 1., which on the other hand has better statistical properties, making the analysis simpler. While the second option was preferred by [9] in their implementation111111https://github.com/efficient/cuckoofilter/blob/aac6569cf30f0dfcf39edec1799fc3f8d6f594da/src/cuckoofilter.h, we retain the slower, but easier to analyse option.

3.3 Semi-sorting

The technique of semi-sorting was introduced in [9] to shave some extra storage. The idea is as follows: treating empty cells as buckets containing the “0” fingerprint, for each row, sort the cells by their fingerprints, and then encode the result as a number.

As an example, for and there are only 3,876 possible sorted states, which can be encoded using bits, as opposed to the bits required to store four 4-bit fingerprints.

3.4 Comparison with Hash Tables

As the name implies, QHT are related to hash tables. Indeed, a hash table is a QHT, wherein the number of rows is equal to . In particular, QHT cannot have worse performances than such structures, and bear similarities with so-called compaction techniques [21].

4 Error Rate Analysis

4.1 False Positive Rate

Consider a QHT with rows, buckets per row, and -bits fingerprints. For simplicity, further assume that no bucket is empty (which is true after some time), and that the stream is sampled uniformly at random from .

Theorem 4.1.

For a QHT of rows, buckets per row and possible fingerprints, as the number of insertions goes to infinity, the FPR goes to . Moreover the probability that an unseen element inserted after other elements triggers a false positive is .

Proof.

An element is a false positive if and only if has not been encountered yet, but . This event is triggered by the presence, in the filter, of another element with a hash and fingerprint colliding with those of . is called a false duplicate, and if is still in the filter when arrives, we refer to the event as a hard collision.

Our first remark is that the only false duplicate that may create a hard collision with is the last false duplicate inserted before arrives: let us assume that , are false duplicates, and that arrives before . When we insert in the filter, is either still in the filter or has been evicted.

  • If has been evicted, then will not hardly collide with .

  • If is still in the filter, then will be claimed as a duplicate and dismissed.

However, if we look at the table storing all fingerprints, dismissing is strictly equivalent to replacing by . Consequently, every false duplicate is erased by any new false duplicate, and only the last false duplicate can cause a hard collision. As a result, we will only focus on the probability that the last false duplicate (before arrives) causes a hard collision.

Let us assume that the last false duplicate appears at position of the stream , in other words, is the last false duplicate in the stream before .

Now, has to remain in the filter until arrives, even though new elements are added. Let us suppose that element , , does not evict from the filter. For to be evicted, the following conditions must be true:

  • is different from all the other fingerprints stored in the row (knowing that one of the buckets contains )

  • is inserted into the bucket in which is stored

Since we know that is the last false duplicate, we cannot simultaneously have and . As such, the first two conditions are not independent. Let be the probability that these two conditions are satisfied. Given that is the last false positive, there are only possibilities for the couple (. Moreover, among these states, only verify the first condition, and among these states, only verify the second condition. Finally, . If we assign the last event to the probability , we immediately get .

Finally, the probability of not being evicted by is and:

Now, has to avoid eviction by every element before arrives, i.e. by all elements , which happens with probability .

At that point, we know the probability that a hard collision happens when the last false duplicate has been inserted at position .

The probability of any element being a false duplicate is . In the stream , the last false duplicate is with probability ; it is with probability ; and with probability . The probability that the next element will result in a false positive after elements are inserted, is equal to the probability that the last false duplicate is not evicted. Thus the probability that a false positive happens after elements are inserted is

We now have the probability that a new element , inserted after insertions, is detected as a false positive.

Assuming the stream is of size , and noting its false positive rate, we get that is equal to the averages of all of its stream, i.e., . Given that and using Cesàro’s mean properties, we get, as one could have expected, . ∎

Thanks to the expression of , we can see that the more rows there are, the slower the FPR reaches its asymptotic (saturated) value.

A similar phenomenon can be observed with the Stable Bloom Filter [7]121212In a Stable Bloom Filter, the FPR is bounded by , where is the number of rows, and , , Max are diverse filter parameters. We clearly see that a higher number of rows will only decrease the FPR down to a certain point, but no further.: adding rows will only decrease the FPR to a certain point. An other structure adapted to streaming data we found, the block decaying Bloom Filter [19], operates on a sliding window and therefore did not use the same False Positive definition as the one we did131313More precisely, their definition false positive definition is restrained to the sliding window. So an element is a false positive if is not already present in the sliding window, and yet marked as a duplicate..

4.1.1 Application to SQF

One should be tempted to directly apply this result to an SQF. However, as pointed out earlier, in an SQF fingerprints are correlated and therefore not equiprobable141414For example, consider the SQF with the parameters , . Only the hash will lead to the fingerprint , whereas both hashes and will lead to the fingerprint ..

However, for an optimal SQF, fingerprints are equiprobable so the analysis above holds for optimal SQFs. In the general case, the authors of [8] have approximated the probability of fingerprint collision in an SQF with . Replacing with this probability in our analysis, we get an approximate asymptotic FPR of for SQFs, as announced in Section 3.

4.2 False Negative Rate

Theorem 4.2.

For a QHT of rows, buckets per row, and different fingerprints, assuming , then as the number of insertions goes to infinity, .

Proof.

Assume that is a duplicate, we denote by the last element in the stream such that . Following the same reasoning as for the FPR, will trigger a false negative if and only if is removed from the filter before arrives, and if any false duplicate of , inserted between the removal of and , is deleted before arrives.

Let us assume that is deleted at time (this happens with probability ) by something else than a false duplicate. Using similar arguments than in the previous section, we have .

The probability that all false duplicates, inserted after time , are deleted before arrives is .

Let , and , and denote by the probability, in a stream where the last duplicate of is inserted at , that is deleted before arrives and no false positive persists until ,

Given that , we get:

The probability of to be a false negative is then , where is the probability that the last duplicate already seen is , so .

We obtain the probability that the th element of the stream will be a false negative:

For a stream of elements, the false negative rate is defined as the average error probability:

We show, in Appendix B, that for ,

A not obvious consequence of the above expression for FNR is that when increases, which corresponds to using more memory, it is possible to achieve an arbitrary small error rate.151515This fact is straightforward but requires the substitution of by their full expression, before taking the limit. At some point however, the memory is so large that every stream element can be stored, and there is no need for probabilistic data structures anymore. In practice we expect memory to be a limited resource, so this situation is unlikely to present itself.

Finally, assuming (or even ), i.e., that there are more distinct elements in the stream that what the filter is able to store, we get the limit rate of , which is . ∎

Note that .

The value of also gives (using the same corrections as for the FPR in Section 4.1.1) the FNR for SQF.

4.3 Error Rate and Filter Saturation

Claim 4.1.

The asymptotic relation is universal (as long as ) amongst hash-based duplicate detection filters and is the result of the filter’s saturation. Furthermore, when a filter reaches saturation, it behaves similarly, from the error rate point of view, to a filter answering DUPLICATE at random (we will call such filters random filters).

Proof.

In order to see this, first note that random filters always verify the relation . Given that a random filter will return DUPLICATE with a probability of

, an unseen element will be classified as

DUPLICATE with probability , and a duplicate will be classified as UNSEEN with probability , hence the result.

Now, on infinite streams with infinitely many different elements, filters are saturated with information. Given that a filter can only store at most one element per bit (cf. Section 2.1), thus a filter of size can remember at most elements. However, after insertions (with ), the filter remembers at most a tiny fraction of the stream. Having reached its saturated state, the filter has an extremely tiny probability of correctly guessing whether the incoming element is a duplicate or not, given what the filter actually knows about the stream. For this reason, the best strategy of a saturated filter is almost indistinguishable from a random strategy, i.e. randomly outputting DUPLICATE. When the stream grows indefinitely, the filter becomes asymptotically equivalent to a random filter. ∎

Furthermore, in practical cases, we observe that duplicate filters are equivalent to a specific kind of random filters: they answer DUPLICATE with some fixed probability , depending on the filter’s nature and its parameters (i.e., does not change with time).

Interpretation

The interpretation of these results could suggest that streaming filters are useless: they need more memory than a random filter, despite being asymptotically equivalent. However, this is only true because of our hypotheses and definitions: we define a false negative to be a duplicate element claimed as unseen by the filter. However, after some amount of time, it is often acceptable that the element may be considered as unseen again: for instance a nonce is theoretically unique, but in practice after a reasonable amount of time nonce reuse is not a vulnerability. For this reason, adapting false positive and false negative to sliding windows may be relevant here. Moreover, we assumed that all elements of the stream had the same probability of occurrence. In practice, this hypothesis is not always correct, and as we will see in Section 6, filters operating on real data perform significantly better than random filters, and resist better to saturation.

4.4 Comparing QHT to SQF

Let us compare the memory required for a QHT to reach the same error rates than an SQF.

Given that both FPR and FNR depend only on , and , imposing the equality on these parameters ensures that both filters have exactly the same FPR and FNR.

Note that is not a user-chosen parameter, but rather a consequence of other parameters.

Theorem 4.3.

For exactly the same FPR and FNR, a QHT requires 33% fewer memory than an SQF.

The proof is made through the following subsections.

4.4.1 Deriving S from Filters Parameters

For a QHT, is derived from the number of bits of the fingerprint , with the straightforward relation161616If we are on a system without the empty feature (see Section 3.2, then . For an SQF, one can just assign the empty value to one of the unassigned fingerprints.: . For an SQF, however, the relation is more complicated.

When and are fixed, for any element , let be decomposed as , where is the -bits word used in the fingerprint, being the remaining bits of . For the Hamming weight function, we have . Yet . We know that is entirely dependent on , which is already used in the fingerprint. Thus, if we fix , there are only possible values for and thus for . Given that can have different values, we get that .

4.4.2 Comparing Required Memory

For QHTs, we have the relation . For SQF, the formula is rather (because fingerprints occupy bits).

Given that and , the ratio is:

Given that , we have

Using the recommended settings in [8]171717Their optimal choice of parameters also includes setting . However, setting and imposes , so that setting results in an FPR of 1: after some time the filter systematically responds DUPLICATE. Using e.g. avoids this. (, ) the ratio becomes , which concludes the proof.

4.5 Parameter Tuning

As noted in Section 4.2, no matter the choice of the parameters we have : any particular parameters choice will be a trade off between good FPR and good FNR performance, at least asymptotically. However, when the stream is small enough, one may choose parameters that will maximally delay saturation.

We know that , so plugging this into the FPR formula gives

Which means that, for fixed and (i.e. for a fixed memory amount and a fixed asymptotic FPR), must be as small as possible in order to keep the FPR low for as long as possible.

For instance, assume that we want an asymptotic FPR of 25%. The potential values for the couple are , , and so on (leading respectively to ). Because of the above relation, we know that setting will yield the best saturation resistance for the FPR.

In order to test this heuristic, we compared several QHT of approximately 65,536 bits, with the same ratio

, but a different value for . We took streams of 100,000 elements (well under saturation value for this amount of memory), from an alphabet of elements. Each element of the stream was uniformly randomly selected, leading to a stream with about 4.6% of duplicate elements. We averaged the results on 10 runs.

We observe in Table 1 that filters with a small value of do indeed perform better than filters with a bigger value of , which concludes the experiment.

(%) 22.57 23.25 23.53 23.62 23.50
(%) 35.89 44.24 50.77 54.55 58.73
58.45 67.49 74.30 78.17 82.23
Table 1: Error rates of QHTs with the same asymptotic FPR

5 Further improvements: QQHTD

5.1 Keeping Track of Duplicates

In Algorithm 1, we do not insert anything if the element is detected as a duplicate. However, following [9]’s example, we can insert it anyway, resulting in a structure we call QHT with Duplicates, or QHTD. We briefly discuss its properties.

As we showed previously, the asymptotic FPR of a QHT is , which was expected: each cell stores distinct fingerprints, the probability that one of them matches the fingerprint of a unique element is logically . Similarly, in QHTD each cell stores fingerprints, not necessarily distinct. The probability that at least one of these fingerprints is the same than the one of an unseen element is . Given the results of Section 4.3, the asymptotic FNR of QHTD is .

5.2 Queuing Buckets for a Better Sliding Window

One caveat of the QHT (and QHTD) is the fact that at any insertion, any element of the row is equally likely to be evicted: if this allows an easy FPR and FNR derivation, it makes it functioning a bit counter intuitive. As a matter of fact, one would expect a filter to first forget about the oldest elements before forgetting about the newest ones. Indeed, this behaviour matches the need of a filter operating on a sliding window, without taking into account oldest elements.

The solution we provide for QHT is to order the buckets of a given cell in a FIFO queue, which means that instead of selecting a random bucket in the cell for insertion, one will append the fingerprint to the end of the queue, and pop the first element (so that the size of the queue remains constant). Combined with QHTD improvement, this yields Algorithm 2.

1:for each element  do
2:     result UNSEEN
3:     Quotient of : ; Fingerprint of : .
4:     for each bucket in the queue  do
5:         if (entry in  then
6:              result DUPLICATE
7:              break               
8:     Pop the first element of the queue and append at the end of same queue
9:     return result
Algorithm 2 Queued Quotient Hash Table with Duplicates’ (QQHTD) Stream

Note that classical queues (i.e. doubly chained lists) are not suited for our use, as chains require extra storage bits for pointers. Thus, we create an array of elements, in which we manually move every element at each “pop”. When is small (typically less than 5, which it usually is), the added overhead is not significant.

Finally, note that a QQHTD with one bucket par cell is equivalent to its QHT counterpart. However, with more buckets per cell, QQHTDs offer a noticeable improvement over QHTs on real data streams (see Appendix A).

6 Benchmarks

Stream (duplicate %) Memory (bits) SQF QHT QQHTD Cuckoo SBF A2 b_DBF
Real (10.3 %) 8e+06 51.25 43.78 66.48 54.08 58.94 45.22
1e+06 55.18 48.38 68.96 57.17 62.12 45.30
100,000 58.21 52.22 71.83 59.44 64.88 59.39
10,000 66.45 58.75 78.86 76.29 66.42 99.77
Artificial (88.82 %) 8e+06 86.49 82.76 96.94 97.79 88.06 99.96
1e+06 98.27 97.80 99.60 99.74 98.49 99.96
100,000 99.78 99.79 99.96 99.98 99.84 99.96
10,000 99.97 99.97 100.02 99.99 100.00 99.98
Artificial (39.79 %) 8e+06 96.51 95.37 99.21 99.40 96.86 99.99
1e+06 99.56 99.42 99.91 99.91 99.60 99.99
100,000 99.94 99.95 100.00 99.99 99.96 99.98
10,000 100.00 100.00 100.00 100.00 99.99 100.00
Table 2: Error rate (multiplied by ) on streams of 150,000,000 elements

6.1 Comparison of QHT to Other Filters

We used streams of 150,000,000 elements, on filters of size ranging from 10 kb to 8 Mb. In any case the filters are too small to keep track of the whole stream, and we will see that filters do reach saturation. We used 2 artificial streams, for which the elements where randomly generated from an alphabet of and elements respectively, leading to a duplicate rate of about and respectively. We also used a real dataset of URLs visited by a crawling robot, extracted from the April 2018 CommonCrawl’s dump [17] The source code is available from on BitBucket.181818https://bitbucket.org/team_qht/qht/src/master.

The filters, so their asymptotic FPR was as close as possible to the arbitrary value of 25%, are:

  • SQF, 1 bucket per row, and

  • QHT, 1 bucket per row, 3 bits per fingerprint. This specific QHT is equivalent to a QQHTD with the same parameters, so we do not include the latter in the benchmark.

  • Cuckoo Filter[9], cells containing 1 element of 3 bits each

  • Stable Bloom Filter (SBF) [7], 2 bits per cell, 2 hash functions, targeted FPR of 0.02191919Our benchmarks actually obtained an asymptotic FPR of around 28%, without us being able to find bugs in our implementation.

  • Filter[22], targeted FPR of on the sliding window.

  • Block-Decaying Bloom Filter (b_DBF)[19], sliding window of 6000 elements.

Results, averaged on 5 runs, are given in Table 2. Note that, for better readability, the error rates have been multiplied by in the table. Further more, we recall that the error rate, being defined as , is bounded by below and above, being the error rate of a random filter. A filter can have worse results than random; for instance a filter which is always wrong has an error rate of .

As we can see in Table 2, QHT (or QQHTD) are extremely competitive and resist very well to saturation; they also appear to be the most competitive on the real stream. b_DBF are efficient, but reach very quickly their saturation. A more detailed analysis of their FPR (see Appendix A) shows that even though their FPR is close to 0, they get an FNR close to 1. Moreover, as we see in Section 6.2, they are significantly slower, which can be a bottleneck for critical applications. Further more, not only are QHT/QQHTD the most efficient filter on both real and artificial streams, they are also very easily tunable, and any asymptotic FPR rate is very simply achievable. This is not the case of other filters, such as A2 or b_DBF, which require careful tuning.

6.2 Speed Comparison

We also benchmarked the speed of every filter on real-time detection, on a laptop with Intel i7. We used the same filters as in the previous subsection, with a memory of 1 Mb, on a stream of 150,000,000 elements. We averaged, on these filters, the time needed for to execute for each element. Results are shown in Table 3. We observe that safe for SQF, QHT is 6 times as fast as any other filter, and 10 times as fast as b_DBF. Even SQF is 50% slower than QHT. Even with additional features, QQHTD are also faster than SQF by a large magin, because fingerprint derivation is more costly in the latter. As a conclusion, we observe that QQHTD are most suited filters for high-speed analysis.

Filter SQF QHT QQHTD Cuckoo SBF A2 b_DBF
Time 0.423 0.288 0.330 2.464 1.578 1.280 2.565
Table 3: Average amount of time (in s) required for one iteration of on each filter with 1 Mb of memory

7 Adversarial Resistance

Now, despite the good performances of QQHTD on normal streams, one may not always assume that the stream is “normal”: there may be an attacker trying to fool the filter. For instance if the filter must detect duplicates in order to avoid an attack (nonce requirements), then the question of adversarial resistance is primordial.

We can model an adversarial system in which an attacker has a knowledge of the output of Stream (i.e., whether the element is classified as DUPLICATE or UNSEEN), but no knowledge of the internal memory state . Thus the attacker is able to carry an adaptive attack, by choosing the next element to send to the filter as a function of all previous insertions. In this adversarial game, the attacker can send an arbitrary stream to the filter, and is allowed to get the result of Stream for every element. Then, at her convenience, the attacker goes into the second phase of the game, in which she has two possible actions:

  • Send an unseen element that will be a false positive with high probability (false positive attack);

  • Send a duplicate element that will be a false negative with high probability (false negative attack).

Theorem 7.1.

No filter can resist a false negative attack.

Proof.

We craft false negatives in steps ( being the filter’s memory size). Let us remind that no structure can remember more than one element per memory bit. For this reason, the structure can remember at most different elements. Consequently, if the attacker generates a stream of random unseen elements, then on average each element will stay for insertions in the filter’s memory. More generally, after insertions (for some rational ), an element is forgotten with probability at least . Thus an attacker simply generates unique elements before sending the first element again.

can even be estimated via saturation (see Section 

4.3). Given that CPU time is cheaper than memory requirements, the attacker keeps her advantage over any filter of any size. ∎

Theorem 7.2.

Assuming the existence of one-way functions, QHT can resist a false positive attack.

Proof.

Following [18] we replace all hash functions by one-way hash functions, and apply a (secret) one-way permutation on incoming elements, then classically store the results in the filter. Because of the permutation, the attacker gains no advantage in adaptively choosing the elements, thus loosing her advantage. ∎

As a conclusion, QHTs are adapted to contexts where low false positives are crucial, such as white-list email filtering. Note however that this is the case of most filters, as long as they rely on hash functions (and so can apply [18]).

8 Conclusion

This paper introduces a new duplicate detection filter, QHT, and its variant QHTD. QHTs achieve a better utilization of the available space, and as such are more efficient than existing filters. Moreover, QHTD have more efficiency for detecting duplicates in a real dataset.

We showed that, for an infinite stream with an infinite number of unseen elements, the number of rows is less important than the fingerprint space, and the number of buckets per row. Moreover, we proved that all filters, having reached saturation, are not more efficient than random filters, and as such, a benchmarking of stream filters should only focus on the pre-saturation state, with small streams.

Even though QHTs are significantly more efficient than other structures in the literature, we do not know if these filters are optimal: are there other filters having an optimal resilience to saturation? Future work also includes examining the theoretical resistance to saturation of QQHTDs, and a finer examination of the QHT/QQHTD behaviour on a sliding window.

Appendix A Tables of error rates for various streams

Table 4 gives the error rates (FPR and FNR) used to derive Table 2.

Stream (duplicate %) Memory (bits) SQF QHT/QQHT Cuckoo SBF A2 b_DBF
Real (10.3 %) 8e+06 24.61/26.64 14.00/29.78 28.23/38.26 25.10/28.98 37.72/21.21 0.00/45.22
1e+06 24.95/30.22 14.25/34.14 28.30/40.65 25.23/31.94 38.00/24.11 0.15/45.16
100,000 24.99/33.22 14.28/37.94 28.33/43.51 25.42/34.03 38.05/26.83 25.75/33.64
10,000 25.00/41.45 14.29/44.46 28.37/50.49 26.00/50.29 38.01/28.41 99.58/0.20
Artificial (88.82 %) 8e+06 21.87/64.62 12.02/70.74 27.80/69.14 25.94/71.85 35.22/52.84 0.00/99.96
1e+06 24.60/73.67 14.00/83.80 28.41/71.19 26.43/73.30 37.72/60.77 0.17/99.79
100,000 24.94/74.84 14.26/85.53 28.52/71.44 26.49/73.49 38.00/61.83 27.46/72.51
10,000 24.98/74.99 14.28/85.69 28.55/71.47 26.50/73.50 38.03/61.97 99.67/0.31
Artificial (39.79 %) 8e+06 24.42/72.09 13.86/81.52 28.39/70.82 26.40/73.00 37.54/59.32 0.00/99.99
1e+06 24.92/74.64 14.24/85.18 28.50/71.41 26.47/73.44 38.00/61.60 0.17/99.82
100,000 24.99/74.96 14.29/85.66 28.51/71.49 26.49/73.50 38.05/61.91 27.46/72.52
10,000 25.00/74.99 14.28/85.72 28.52/71.49 26.49/73.51 38.02/61.97 99.69/0.31
Table 4: Results of filters for various streams of 150,000,000 elements (FPR/FNR in %)

Appendix B Deriving

This derivation was removed from Section 4.2 for better readability. We know that , so

Expanding,

Given that , using Cesàro’s mean we get:

Which concludes the proof.

Appendix C Comparison of QHT and QQHTD

In this appendix, we explore the difference between a QHT and a QQHTD with the same parameters, on the same stream. We took filters of 65,536 bits each, on streams of 100,000 elements each. One stream issued from our ‘real’ dataset (10.32% of duplicates), the other a random uniform stream on an alphabet of elements (4.62% of duplicates). Results are given in 5.

We observe that while QQHTD offer no advantage on artificial streams, their performance (relative to the QHT) are noticeably better, which empirically validates the optimizations.

Stream (duplicate %) Filter
Real (10.3 %) QHT 27.01 28.07 28.95 29.3
QQHTD 24.67 24.56 24.42 24.62
Artificial (4.62 %) QHT 67.39 74.48 79.32 82.10
QQHTD 67.91 74.41 79.19 82.26
Table 5: Error rate (times ) of QHTs and QQHTDs with the same parameters on different streams, depending on their parameters and

References

  • [1] Babcock, B., Datar, M., and Motwani, R. Load shedding for aggregation queries over data streams. In Proceedings. 20th International Conference on Data Engineering (March 2004), IEEE Computer Society, pp. 350–361.
  • [2] Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422–426.
  • [3] Borg, M., Runeson, P., Johansson, J., and Mäntylä, M. V. A replicated study on duplicate detection: Using apache lucene to search among android defects. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (New York, NY, USA, 2014), ESEM ’14, ACM, pp. 8:1–8:4.
  • [4] Chen, H., Liao, L., Jin, H., and Wu, J. The dynamic cuckoo filter. In 2017 IEEE 25th International Conference on Network Protocols (ICNP) (Oct 2017), pp. 1–10.
  • [5] Cohen, S., and Matias, Y. Spectral bloom filters. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2003), SIGMOD ’03, ACM, pp. 241–252.
  • [6] Cormode, G., and Muthukrishnan, S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58 – 75.
  • [7] Deng, F., and Rafiei, D. Approximately detecting duplicates for streaming data using stable Bloom filters. In SIGMOD Conference (2006), ACM, pp. 25–36.
  • [8] Dutta, S., Narang, A., and Bera, S. K. Streaming quotient filter: A near optimal approximate duplicate detection approach for data streams. Proc. VLDB Endow. 6, 8 (June 2013), 589–600.
  • [9] Fan, B., Andersen, D. G., Kaminsky, M., and Mitzenmacher, M. D. Cuckoo filter: Practically better than bloom. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies (New York, NY, USA, 2014), CoNEXT ’14, ACM, pp. 75–88.
  • [10] Fan, L., Cao, P., Almeida, J., and Broder, A. Z. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. 8, 3 (June 2000), 281–293.
  • [11] Fu, M., Feng, D., Hua, Y., He, X., Chen, Z., Xia, W., Zhang, Y., and Tan, Y. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (Berkeley, CA, USA, 2015), FAST’15, USENIX Association, pp. 331–344.
  • [12] Guo, D., Wu, J., Chen, H., Yuan, Y., and Luo, X. The dynamic bloom filters. IEEE Transactions on Knowledge and Data Engineering 22, 1 (Jan 2010), 120–133.
  • [13] Jowhari, H., Saglam, M., and Tardos, G. Tight bounds for lp samplers, finding duplicates in streams, and related problems. CoRR abs/1012.4889 (2010), 49–58.
  • [14] Kapralov, M., Nelson, J., Pachocki, J., Wang, Z., Woodruff, D. P., and Yahyazadeh, M. Optimal lower bounds for universal relation, and for samplers and finding duplicates in streams. In FOCS (2017), IEEE Computer Society, pp. 475–486.
  • [15] Køien, G. A brief survey of nonces and nonce usage. In Securware 2015 - The Ninth International Conference on Emerging Security Information, Systems and Technologies" (2015), SECURWARE ’15, IARIA XPS Press, pp. 85–91.
  • [16] Metwally, A., Agrawal, D., and El Abbadi, A. Duplicate detection in click streams. In Proceedings of the 14th International Conference on World Wide Web (New York, NY, USA, 2005), WWW ’05, ACM, pp. 12–21.
  • [17] Nagel, S. April 2018 crawl archive now available, 2018. http://commoncrawl.org/2018/05/april-2018-crawl-archive-now-available/.
  • [18] Naor, M., and Yogev, E. Bloom filters in adversarial environments. In Advances in Cryptology – CRYPTO 2015 (Berlin, Heidelberg, 2015), R. Gennaro and M. Robshaw, Eds., Springer Berlin Heidelberg, pp. 565–584.
  • [19] Shen, H., and Zhang, Y. Improved approximate detection of duplicates for data streams over sliding windows. Journal of Computer Science and Technology 23, 6 (2008), 973–987.
  • [20] Tarkoma, S., Rothenberg, C. E., and Lagerspetz, E. Theory and practice of bloom filters for distributed systems. IEEE Communications Surveys Tutorials 14, 1 (First 2012), 131–155.
  • [21] Wolper, P., and Leroy, D. Reliable hashing without collision detection. In Computer Aided Verification (Berlin, Heidelberg, 1993), C. Courcoubetis, Ed., Springer Berlin Heidelberg, pp. 59–70.
  • [22] Yoon, M. Aging bloom filter with two active buffers for dynamic sets. IEEE Trans. on Knowl. and Data Eng. 22, 1 (Jan. 2010), 134–138.