Approximate Membership Query Filters with a False Positive Free Set

In the last decade, significant efforts have been made to reduce the false positive rate of approximate membership checking structures. This has led to the development of new structures such as cuckoo filters and xor filters. Adaptive filters that can react to false positives as they occur to avoid them for future queries to the same elements have also been recently developed. In this paper, we propose a new type of static filters that completely avoid false positives for a given set of negative elements and show how they can be efficiently implemented using xor probing filters. Several constructions of these filters with a false positive free set are proposed that minimize the memory and speed overheads introduced by avoiding false positives. The proposed filters have been extensively evaluated to validate their functionality and show that in many cases both the memory and speed overheads are negligible. We also discuss several use cases to illustrate the potential benefits of the proposed filters in practical applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/06/2021

Telescoping Filter: A Practical Adaptive Filter

Filters are fast, small and approximate set membership data structures. ...
09/04/2021

Stretching Your Data With Taffy Filters

Popular approximate membership query structures such as Bloom filters an...
01/12/2021

Using uncertainty estimation to reduce false positives in liver lesion detection

Despite the successes of deep learning techniques at detecting objects i...
04/28/2020

Certifying Certainty and Uncertainty in Approximate Membership Query Structures – Extended Version

Approximate Membership Query structures (AMQs) rely on randomisation for...
05/11/2022

Raw Filtering of JSON Data on FPGAs

Many Big Data applications include the processing of data streams on sem...
11/07/2018

Bicoherence analysis of nonstationary and nonlinear processes

Bicoherence analysis is a well established method for identifying the qu...
05/23/2018

WisenetMD: Motion Detection Using Dynamic Background Region Analysis

Motion detection algorithms that can be applied to surveillance cameras ...

Code Repositories

fastfilterfpfs

Implementation of False Positive Free Set Xor Filters


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Approximate membership checking is widely used in many computing applications [32]. For example, to accelerate access to key-value stores [9] or to data stored in external memory [29]. Approximate membership checking is also used in networking, for example for packet classification [30]

. The most frequently used data structure for approximate membership queries is probably the Bloom filter

[4] for which many enhancements and optimizations have been proposed [23]

. A Bloom filter maps an element to a set of bits in a bit vector using a group of hash functions and sets those bits to one on insertion and checks if they are one when checking membership. This construction ensures that there are no false negatives. Instead, false positives can occur when other elements have set to one the positions checked by a given element. Therefore, the approximate checking manifests as a false positive probability for elements not originally added to the filter

[4]. This is the price paid for having a much smaller and faster representation of the set. Alternative approximate membership querying structures have also been proposed like the cuckoo filter [14] or more recently the xor filter [18] that can achieve smaller false positive probability with the same memory budget in some settings.

Since false positives are the main drawback of all these approximate membership querying structures, significant efforts have been made to reduce them for example by adapting the filter to react to false positives as they occur [3],[28]. Another approach is to design filters that are false positive free when the set stored is small and from elements taken also from a small universe [20]. Unfortunately, this limits the use of the false positive free filters to very small sets having only a few elements. In this paper we introduce a new type of filters that are false positive free on a subset of the negative elements. That is, the proposed filters do suffer false positives over the entire universe but we can select a set of negatives, that we assume can be larger than the set of elements stored in the filter, that will not suffer false positives. This is useful to reduce the observed false positive rate by including the most commonly accessed negative elements on that set.

To better understand the problem we address in this paper, let us consider one of the initial applications of a Bloom filter, storing the set of correct words to check their spelling [26]

. In more detail, all the words of the dictionary are stored in the Bloom filter and used to check written words so that an error is flagged when the Bloom filter returns a negative. This design ensures that valid words are always classified as correct but there may be some invalid words that are not detected. An interesting observation is that there are some common misspellings that occur frequently for which it would be beneficial not having false positives on the filter. However, the Bloom filter treats all the words that are not in the dictionary equally having all the same false positive probability.

In this paper we consider this problem: given two disjoint sets and of elements that belong to universe , design a false negative free filter that stores and has no false positives for elements in and a low false positive probability for the rest of the elements that are neither in nor in . That is, our goal is to design a filter for which elements in are guaranteed to query positive, elements in are guaranteed to query negative, and most other elements query negative. We denote these filters are false positive free on a set () that is possibly larger than . Obviously such a filter could be used with a set of the words in the dictionary and a set with the common misspellings to make the filter based spell checker more effective. These filters with a false positive free set would also be beneficial in applications such as filtering for database queries [9], in memory key-value stores [29], for filtering of URLs[1] or in lightweight Bitcoin nodes [16] among many others as will be discussed in several case studies in section 5.

The rest of the paper is organized as follows, section 2 covers the preliminaries by briefly reviewing existing approximate membership query filters, the efforts to reduce their false positive rate and providing an overview of xor probing filters that are used to build our proposed filters. The proposed filters are presented and analyzed in section 3. The evaluation of the filters is presented in section 4 that also compares them with existing filters. Section 5 discusses several use cases where the use of the proposed filters would be beneficial. Finally, the paper ends in section 6 with the conclusion and some ideas for future work.

2 Preliminaries

2.1 Approximate membership query filters

As discussed in the introduction, approximate membership checking is widely used in computing and networking and different structures have been proposed to implement it. The most well-known and commonly used option is Bloom filters [23] but other alternatives like the Bloomier filter [6], cuckoo filters [14] or xor filters have been proposed in the last two decades [18]. In all cases, elements are mapped to several positions on an array or table using a set of hash functions . The contents of those positions are used to check the membership. The filters can be classified according to how the checking is done [12]:

  • And probing: the logical And of the positions is used to check membership. This is the case of the Bloom filter in which an array of bits is used and true is returned when the and of positions is a one and a false is returned otherwise.

  • Or probing: the logical Or of the matching of the contents of the positions with a value derived from the element is used to detect positives. This is the scheme used in the cuckoo filter that computes a fingerprint per element and compares it with the fingerprints stored in the positions and if any of them matches the element’s fingerprint true is returned [30].

  • Xor probing: the logical Xor of the positions is used to compute a value that is used to detect positives. A value derived from the element is compared with and true is return on a match and false otherwise. This is the scheme used in the xor and ribbon filters [18],[12].

Each filter type and construction has advantages and disadvantages. For example, Bloom filters are the simplest to construct and their false positive probability (FPP) degrades gracefully if the number of elements added does not match what was predicted and optimized for. However, they do not support deletions unless counters instead of bits are used and their FPP is large in some settings [23]. Another advantage of Bloom filters is that they can be implemented on a single memory access [25], so can be made faster than cuckoo filters in software [22]. While cuckoo filters support removals and have lower FPP in many settings, space efficiency with low FPP depends on high occupancy but not so high to fail insertion [30]. Finally, xor filters [18] and ribbon filters [12] achieve better FPP for the same memory budget in many cases, with good speed, but do not support dynamic insertions and removals, limiting their use to static filtering applications, in which real time updates to the set are not needed. In summary, the best filter configuration depends on the application requirements and implementation platform.

2.2 Reducing the false positive probability of filters

First, when the universe of elements is sufficiently small, it is possible to represent an exact set with no false positives, using less space than a generic filter with low but non-zero FPP, e.g. [7]. It is even possible for exact sets to degrade into filters with non-zero FPP as more elements are added [20, 31, 11]. We expect only a small minority of applications to have universes sufficiently small to justify these approaches. The rest of the paper assumes a large or infinite universe.

One of the main performance parameters of approximate query filters is the FPP that they achieve for a given memory budget and many efforts have been made to reduce it. For example, for Bloom filters optimizations like the Variable Increments Bloom filter have been proposed to reduce the FPP [4]. Indeed, one of the main motivations to develop cuckoo, xor and ribbon filters was to improve space efficiency in the FPP.

Another strategy is to reduce selectively the FPP only for elements that are frequently accessed. This can be done in the Bloom filter by setting some positions to zero in the array that correspond to false positive elements that are frequently accessed. This however introduces false negatives for other elements [13]. The knowledge of elements that are frequently accessed may not be known in advanced, and in that scenario filters have to remove false positives adaptively as they occur [3]. For example, Adaptive Cuckoo Filters (ACFs) change the stored fingerprints to remove false positives as they occur [28].

As discussed in the introduction, in this paper we focus on the design of filters that are false positive free on a set of elements while allowing false positives for the rest of the negative elements in the universe. To the best of our knowledge efficient designs for such filters when is large have not been proposed before. False positives can be avoided by using a Bloomier filter [6] but each such false positive requires an entry on the filter thus incurring a large cost when is large. There have also been efforts to reduce the false positives for a given subset of frequently accessed elements like for example [10],[28] but not to completely avoid them by construction.

2.3 Xor probing filters

An xor probing filter can be seen at a high level as a function that for a given element returns a value that is obtained by performing the xor of several positions on the filter given by hash functions . The filter is constructed so that holds when an element has been inserted into the filter, where is a fingerprint computed on the element. Instead when the element has not been inserted on the filter, in most cases and the probability of having would be approximately , where is the number of bits in .

A lookup on an xor probing filter is a simple operation that just requires accessing positions , computing the xor of their values to obtain and comparing it with . On a match, true is returned and false is returned otherwise. The construction of xor probing filters is more complex and needs to be precomputed, so these filters are not suitable for applications that need to perform frequent updates on the stored elements. Two constructions have been proposed so far, the xor filter [24, 18] and the ribbon filter [12]. The first one usually probes random positions while the second probes many more positions in a small window. A standard xor filter uses bits of memory per element, already a significant improvement on Bloom filter’s best case of . With some elaboration [15, 33], both constructions approach information-theoretic limits, requiring as little as bits with negligible additive overheads. This makes them the most memory efficient approximate membership query structures to date for many settings. In the following, we assume that in general an bit xor probing filter requires memory bits per element.

In the rest of the paper we will use xor probing filters to construct our filters with false positive free sets in a general form. The specifics of xor filter constructions will be covered in the evaluation section when analyzing the practical performance of our filters. Before presenting the proposed filters in the next section, the main notations used in the rest of the paper are summarized in Table 1.

Symbol Meaning
universe of elements
set stored in the filter
set for which false positives must be avoided
residual set for which false positives must be avoided
number of bits of the base xor filter
number of subfilters of the second filter in the Integrated Filter (IF) design
number of bits added to the first filter
false positive probability
memory overhead required by the xor probing filter
Table 1: Summary of main notations

3 Avoiding False Positives for a given set

As discussed in the introduction, our goal is to design filters that provide approximate membership checking for a set ensuring that all elements of set return a negative on the filter. In the following subsections we first formally define the target filters, analyze their lower bound memory requirements theoretically and discuss briefly several potential options to construct the filters. Then, the proposed constructions are presented starting from a naive implementation that is used to illustrate the main ideas and the different constructions are analyzed theoretically. Finally, some potential optimizations for the proposed constructions are briefly discussed.

3.1 Filters with a False Positive Free Set

For a universe and a number , a filter with a false positive free set is a randomised data structure constructed from two disjoint sets . When queried for , it returns a bit such that

  • for all ,

  • for all ,

  • for all , where “” refers to randomness used in the construction algorithm.

Note how this is a generalisation of several important problems. For we obtain an ordinary filter. For (or ) we obtain a set data structure. If we omit condition (iii) or then the problem has been referred to as the relative membership problem [2] and calls for a static function with -bit values.111Unless , a compressed static function should be used for space efficiency [2, 15, 19]. We do not claim any improvements for these widely studied special cases. From now on let , and . Our main interest concerns cases where and are both large, and is bounded away from and .

3.2 Space Lower Bound

Using arguments resembling those in [27, Chapter 2.3] for perfect hashing and in [5] for Bloom filters, we now derive a space lower bound for filters with a false positive free set. Such a data structure implies a related data structure with essentially the same space consumption that satisfies (i), (ii) and the more combinatorial (and slightly weaker) property

(iii’)

for some . The number of possible inputs for such a data structure is

Different inputs are not necessarily handled differently. Indeed, after construction, the data structure returns for a set of elements of and for . As such, a single memory state is suitable for all inputs where and , i.e. for

inputs. By the pigeon hole principle, at least memory states are possible after construction and hence bits of memory are required. To obtain a simple bound, we assume that is large, more precisely . In that case

When is small a good approximation is obtained using , namely

A lower bound for a filter with a false positive free set is now obtained by choosing that minimises space consumption, i.e.

(1)

Note that this suggests two regimes: When then using is optimal and the term depending on is negligible. We recover essentially the lower bound for ordinary filters. When is larger, however, the minimum is attained for a value . In particular the most compact filter with false positive free set should be expected to have a false positive rate smaller than required. We will reencounter this phenomenon in our implementation which will, up to a constant factor, achieve the lower bound (1).

3.3 High-level Design Strategies

Before developing our preferred solutions, let us discuss a few high-level strategies that come to mind.

  • One might construct a filter for and—as an afterthought—retouch it [13] such that no element from is a false positive, but without introducing false negatives. Applied to Bloom filters and xor filters, this plan seems hopeless. It is plausible, however, that it can work for adaptive filters. For example, in adaptive cuckoo or quotient filters [28],[3] by adapting the filter to the set before using the filter. The same applies to more recent variants of adaptive cuckoo filters [21] where stored fingerprints that cause a false positive can be moved to alternative positions where they do not. However, removing one false positive has a chance of creating false positives for different elements in , making it unclear how to guarantee the simultaneous removal of all false positives for efficiently. Even if this alternative works, our favoured approach is simpler and more space efficient.

  • One could construct a cascade of filters as suggested for a similar setting in [6]. First, construct a filter for . Then, construct a filter for those elements from that are false positives of the first filter. Then, construct a filter for those elements from that are false positives of the second filter (making them “false false positives”) and so on resulting in a sequence of ever smaller filters, each correcting the mistakes of the previous one. This would likely work, but our favoured approach is simpler and has better worst-case query time.

  • One could use a -bit static function with support and value for and for . Unfortunately, most constructions return a uniformly random bit for , giving us no easy way to control the false positive rate. An exception might be the construction from [2] where the output distribution for inputs coincides with the output distribution for random elements from the support. This yields at first, but could plausibly be tuned for any false positive rate. We have not pursued this approach because it is rather complicated with no practical implementation that we know of.

  • Bloomier filters [6] can solve the problem at hand. Even though we improve upon the idea later in the paper, it is instructive to consider it in some detail in Subsection 3.4.

Our favoured approach involves aspects of some of these ideas, combining a filter with a static function. Since xor-probing is suitable to implement both of these sub-tasks, it is the natural probing strategy for us to use.

3.4 Naive construction

As discussed in the previous section, an xor probing filter ensures that for each element inserted, the filter computes a value that is equal to . Indeed a positive is returned when and a negative otherwise. Therefore, to ensure that an element returns a negative, we can insert it on the filter with a value . Then when searching for the filter will return that is different from so a negative is obtained. This is illustrated in Figure 1. This naive construction can also be implemented with a Bloomier filter [6] using two categories—one for positives and one for negatives—and storing and on each of them respectively.

Figure 1: Diagram of the naive construction of a filter with a false positive free set .

This simple scheme can be used to insert all the elements in with values different from their values and thus avoid false positives for . However, this comes at a large cost in terms of memory as now the filter stores elements and thus requires approximately:

(2)

memory bits for a filter with a false positive rate of . This means that ensuring that each element on is not a false positive requires bits, the same memory as for the elements in . On the other hand, the check operation on the filter remains the same and thus removing the false positives has no implications on speed.

Finally, let us consider the False Positive Probability (FPP) for elements of the universe that are not in . For those, the value returned by the filter

would look like a uniformly distributed random value and thus the probability that it matches the value of the element

and thus a false positive is obtained will be approximately .

3.5 Two filters (TF) construction

The memory footprint can be significantly reduced by observing that to ensure that an element is not a false positive, using a one bit filter suffices. Therefore, an bit filter can be used to store the elements in and a second one bit filter to store the elements in (the elements in with their negated bit value). In fact, this second one-bit filter can be seen as a static function that maps elements in to 1 and elements in to 0. Based on a subtle observation, not all elements in have to be inserted on the second filter. Indeed only the elements in that return a false positive on the first filter need to be inserted on the second filter. Let us denote as the set formed by the elements in that return a false positive in the first filter, and let . Then only the elements in have to be inserted on the second filter and , so is typically much smaller than . Therefore, this scheme would require:

(3)

memory bits and can be expressed as a function of as:

(4)

which is much lower than that of the naive construction (using assumption from assumption ).

Looking carefully at the second term of equation 3, it can be observed that each element that needs to be removed requires bits. This is not efficient when . To explain why, for example, let us consider that so that , then we would need memory bits. Instead, let us consider an alternative configuration in which the first filter has bits instead of bits. Then, we would have and only memory bits are needed. This shows that when , it is more efficient to add bits to the first filter using such that . This is so because by adding an additional bit to the first filter we reduce in half the size of and thus when is large this is more efficient than storing all the elements in in the second filter. Therefore, in the TF construction when that occurs, bits are added to the first filter to reduce the size of . This generalized TF construction is illustrated in Figure 2.

Figure 2: Diagram of the Two Filter (TF) construction with a false positive free set .

The memory requirements are now:

(5)

that can be expressed as a function of as:

(6)

so much smaller than in the naive construction. The FPP would be approximately so the use of additional bits on the first filter has the side benefit of reducing the FPP.

A drawback of the TF construction is the speed overhead of checking two filters for some queries, specifically positive and a fraction of negative queries. This would be acceptable in many cases as negative queries motivate the choice to use a filter.

3.6 Integrated filters (IF) construction

The two filters construction is effective to reduce the memory footprint (space) but introduces other overheads (time) that may be an issue in some applications. Therefore, it would be beneficial to have a construction that is as fast as a single filter and is more memory efficient than the naive construction requiring positions. By increasing the number of positions in filter 1 to match filter 2, from to positions, both filters can be fully integrated in the same data structure. And by using the same position hash functions for both222This leads to potential correlation in construction success probability, not in false positive probability., the bits for both filters can be concatenated on a single larger entry per position, so that querying both involves no more memory lookups than a single filter query. This scheme is illustrated in Figure 3 and would require a memory footprint of:

(7)

In this case, adding bits to the first filter is more beneficial as the cost of the elements in is larger. The analytical derivation of the optimal value for becomes more complex as it now appears twice on the equation but it can be easily found by computing equation 7 starting with and increasing until the memory required increases. Alternatively, when is much larger that , we can approximate with and then the condition to minimize the memory becomes becomes and thus:

(8)

This results in a memory overhead that is larger than that of the TF approach, especially when the initial is large.

Figure 3: Diagram of the Integrated Filter (IF) construction of a filter with a false positive free set when . Filters 1 and 2 use the same hash functions to compute the addresses to access and have the same size.

To reduce the overhead, the previous construction can be generalized as shown in Figure 4. It can be seen that now bits are used for the second filter that is decomposed in subfilters each storing a subset of the elements in . As in the initial construction, the bits are concatenated per position, each having bits for the first filter and for the second. In more detail, another hash function is used to compute that selects the subfilter in the second filter. Therefore each subfilter stores approximately elements. When this is smaller than , the memory footprint is:

(9)

The minimum value of that can be used to ensure that given a set can be computed as:

(10)

So the minimum memory needed is given by:

(11)
Figure 4: Diagram of the Integrated Filter (IF) construction of a filter with a false positive free set when . Filter 2 is composed of two subfilters each element maps to one of them given by .

As in the previous constructions, a subtle observation is that when the value obtained for is larger than two, it is better to increase the value of and use . This is better seen with an example, let us assume that is equal to 3.6. Then would take a value of five, so five bits have to be added the filter. Instead if we use , then would be 0.9 and would take a value of two, needing in total four bits, so fewer than the original five. In general, by adding bits to to obtain , the following number of bits would be needed:

(12)

when this value is lower than that of Equation 10, adding bits to is more efficient.

Therefore, the procedure to select the parameters for the IF design is as follows:

  1. Determine according to equation 10.

  2. If no additional bits are needed for , the filter is implemented with the original value of .

  3. If add bits to and recompute equation 10 to obtain the final value of .

  4. Implement the IF with bits on the first filter and subfilters in the second filter.

So, it can be seen that in all cases is used. When , for example when is equal to 1.8, both alternatives require the same number of bits (four), but adding the bit to has the side benefit of reducing the false positive rate for elements that are not in so it should be used.

In summary, in the integrated filter construction to reduce the memory footprint in a given design, the value of is always fixed to two and is computed. In terms of lookups, this construction does not introduce additional memory accesses, only some operations to select and check the subfilter when needed.

The FPP would also be approximately so similar to previous constructions when and lower otherwise.

3.7 Analysis

Let us now compare the different filters with a false positive free set (FPFS) constructions presented in the previous subsections. We consider a set of size one million elements, , and different sizes of the FPFS set and compute the memory required by the three constructions using the equations derived in the previous subsections. The results are shown in Figure 5. It can be seen that the naive construction requires much more memory than the others as the size of increases confirming that it is not an efficient construction unless is very small. The same reasoning applies to a Bloomier based construction and therefore none of the two is considered further in the rest of the paper.

Figure 5: Memory bits per element in to implement the different FPFS filters for different sizes of when and for of size one million elements.

To better compare the other two constructions, their results are shown in Figure 6 for and in Figure 7 for that also show lower bound (1). When , TF is the most efficient construction in terms of memory while IF with is more effective when is small while the IF with is better when is large. As the size of increases, the memory required also increases and more bits are added to the first filter. When , the use of memory is similar in most cases (except when is very large for the IF with ) as the size of is very small (recall that ). Finally it can be seen that the proposed filters are reasonably close to the lower bound being the difference mostly due to the =23% memory overhead introduced by xor filters [18].

An interesting observation is that the IF construction offers an efficient alternative for all values of by selecting the appropriate value of . Therefore, the integrated construction can be used to implement FPFS filter with a small memory footprint and lookups that require the same number of memory accesses as the original xor probing filter from which they are derived.

Figure 6: Memory bits per element in required to implement the TF and IF FPFS filters for different sizes of when and for of size one million elements.
Figure 7: Memory bits per element in required to implement the TF and IF FPFS filters for different sizes of when and for of size one million elements.

3.8 Potential Optimizations

In this section we discuss two potential optimizations for the proposed filter constructions. The evaluation and further analysis of these optimizations are left as future work and not considered further in this paper.

3.8.1 Hybrid of TF and IF

We observe that a hybrid of the TF and IF approaches could achieve the memory footprint of TF with the speed of IF for a portion of queries. This could be useful in speeding up positive queries when TF space efficiency is desired. Essentially, the hybrid would not round up in computing (Equation 10), would separate off and size the subfilter for only elements, and would make a weighted hash, such that for approximately random elements of and for the approximately remaining elements.

3.8.2 Generalized IF

We observe that when minimal rounding is involved in choosing (Equation 10), IF is as memory efficient as TF, as in Figure 6 near . With rounding, only elements are stored where there is space for elements. If an independently hashed subset of elements map to both subfilters, this space can be filled for the purpose of lowering the false positive rate of elements not in . Note that to guarantee false positive freedom, elements in only need to be added to one subfilter even if mapping to both. The approach of mixing the matching of and bits is known to incur only small space efficiency overheads compared to an optimal approach with truly fractional [12]. However, TF and generalized IF do not compare so directly because they offer different choices for false positive rates, except when the structure is standard IF with the same memory efficiency as TF because is a power of two.

4 Evaluation

The proposed filters with a FPFS from Sections 3.5 and 3.6 have been implemented using as base filters the Java implementation of the xor filter [18]333The code is available in https://github.com/amacian/fastfilterfpfs. Then several experiments have been conducted to validate the filters and measure their performance. All simulations have been run on a Windows 10 machine with an Intel(R) Core(TM) i7-10700 CPU running at 2.90GHz.

The first experiment focuses on validating the filter functionality by testing that all elements in return a positive, all elements in a negative and the rest have a false positive probability that is approximately . This has been done for a set of one hundred thousand elements and different sizes of and elements when . First, sets and were generated using random elements. Then, the filters were constructed and then all elements in were queried, in all cases there were no false positives. Therefore, the proposed filters provide the false positive free set feature. The memory usage and false positive rate for elements not in were measured.

The results for memory usage are shown in Figure 8

that includes the theoretical estimates presented in the previous section. The values for an Xor filter are similar to those of the TF scheme when

is small that correspond to the leftmost points on the plots. It can be seen that the simulation results match the theoretical analysis. The TF construction is the most efficient in terms of memory usage as expected but the IF construction can also be implemented with a reasonable amount of memory by appropriately selecting the value of . The results for the naive construction are not shown as much more memory is required making the filters not practical when the number of elements if is large.

Figure 8: Memory required to implement the TF and IF FPFS filters for different sizes of when (left), (middle), (right) and for of size one hundred thousand elements.

The parameter selected for each configuration was also logged and the results are shown in Figure 9. It can be seen that when , the value of is always zero as the size of is much smaller than that of . Instead, when values larger than zero are selected as the size of grows.

Figure 9: Value of for the TF and IF FPFS filters for different sizes of when (left), (middle), (right) and for of size one hundred thousand elements.

Finally, the false positive probability for negative elements that are not in was measured with ten million random elements. The results are summarized in Figure 10 that also shows the theoretical estimate given by . It can be seen that simulation results match the theoretical analysis.

Figure 10: Fraction of false positives for negative elements not in for the TF and IF FPFS filters for different sizes of when (left), (middle), (right) and for of size one hundred thousand elements.

The second experiment evaluates the speed of the proposed FPFS filter. As the absolute performance values are dependent on the platform used to run the experiments and also on the programming language and tools, we also include the results for the base Xor filter Java implementation [18]. This enables a relative comparison in terms of the overhead introduced by our FPFS filters.

The speed is measured for both for the construction of the filter and also for lookups. In the first case, the time needed to complete the construction was measured and compared to that of the original filter. The average over 1000 runs was computed and the results are shown in Figure 11. It can be seen that the proposed filters introduce a significant overhead that is around 2-3x when is small and increases with the size of . This overhead is due to several factors. The first one is the need to construct several filters, two for the TF and IF, constructions and in some cases even three for the IF, construction. The construction of the filters is also linked in the case of IF construction so that when trying a set of hash functions, construction is successful only when the construction of all the filters succeeds. The second factor is that some of those filters are larger as they store elements from in addition to those in . The third factor that contributes to the overhead is the fact that all elements in have to be checked in the first filter to construct the set . This means that as the size of increases, so does the overhead as it is clearly seen in the results on Figure 11. In any case, the construction time remains in the order of seconds even for large . Additionally, the check for the elements in can be easily run in parallel if in some application the overhead becomes an issue. Indeed, in most applications construction is amortized over many query operations so its cost is not a major concern.

Figure 11: Average time needed to construct the filter when (left), (middle), (right) and for of size one hundred thousand elements.

To evaluate the speed of lookups we measure the speed for positives and negatives. Positives would require to check all the filters while for negatives most of them will be identified by the first filter avoiding access to the second so the overhead should be smaller. For positives, all the elements in were queried and the average time was logged while for negatives, one million random elements were tested. The results averaged over 1000 runs are shown in Figures 12 and 13 where the results for the original Xor filter are also included for comparison. It can be seen that the proposed filters introduce an overhead for positives. This lookup time is approximately 2x for the TF construction which needs to access different memory positions for each filter. Instead, for the IF constructions it is much lower, below 1.2x that of the original filter in all cases thus confirming the benefits of integrating the filters when positive lookup speed is a priority. Finally, speed does not have any clear dependency with the size of or the value of for positive lookups.

Figure 12: Average time needed for a positive lookup when (left), (middle), (right) and for of size one hundred thousand elements.

For negative lookups, the proposed filters also introduce a small overhead. In the case of the TF construction, the overhead is small as the first filter is the same and most lookups (approximately ) only need to check the first filter. The IF constructions have also a small overhead because checking the first filter needs some additional operations to extract the relevant bits that share the word with those of the second filter. In any case, the lookup time is only approximately 1.2x that of the original filter in the worst case. In summary, the proposed filters introduce a significant overhead in the construction time while for lookups, the overhead is smaller and can be minimized for positive or negative lookups by selecting the filter construction.

Figure 13: Average time needed for a negative lookup when (left), (middle), (right) and for of size one hundred thousand elements.

The implementation and evaluation confirm that false positive free on a set can be implemented with a moderate overhead in memory and speed when using the xor filter as the base structure. Therefore, for applications on which the cost of a false positive is large, the proposed filters can provide significant benefits by ensuring that the most frequent negative elements do not suffer false positives.

5 Case studies

This section discusses the applicability of the proposed filters by presenting a few use cases and showing the benefits of the proposed filters. The first case study is a simple spell checker, one of the original applications of Bloom filters, that is used to illustrate how the proposed filters work. Then, two more practical case studies a URL deny list and a cryptocurrency application are discussed.

5.1 Spell checker

One of the original applications of Bloom filters was to store the words of a dictionary to detect spelling errors [26]. Although this is no longer implemented with Bloom filters, it can serve to illustrate how our FPFS filters work. In this design, the checker fails to detect spelling errors that have a false positive on the filter. Unfortunately, those may occur for frequent spelling mistakes and it would be beneficial to avoid them. In this scenario, our filter with a false positive free sets can be used by selecting common spelling errors and adding them to set . To illustrate this use case, we have taken the Birkbeck dataset444Available at https://www.dcs.bbk.ac.uk/%7EROGER/corpora.html that contains 36 133 common misspellings of 6 136 words and inserted the words555We have removed common misspellings that are duplicated or correspond to valid words so that the total number of misspelling is reduced to 32 894. in our filter with as set and the misspellings as set . We verified that there are no false positives for those misspellings and measured the false positive rate for other words. The results show that the proposed filters require 60 951, 63 168 and 68 229 memory bits for the TF, IF with and IF with constructions compared to 60 624 memory bits of a traditional filter. The memory overhead is negligible (0.5%) for the TF construction and very small (4.2%) for the IF with . The proposed filters achieve a similar false positive probability for negative elements not in as that of the original filter while eliminating all false positives in . Instead, the original filter suffers on average 130.5 false positives on , based on one thousand sample runs. This illustrates the benefits of the proposed filters in this simple case study.

5.2 URL deny list

Another scenario where our filters can be useful is to implement a deny list of malicious URLs. A filter that stores the list of malicious URLs can be useful to perform an initial checking so that on a negative, the URL can be safely classified as non-malicious. Instead on a positive a full check would be needed to ensure that the URL is not a false positive. Here again, there are many valid URLs that are frequently used and thus ensuring that they do not suffer false positives would be beneficial. As an example, we have considered the URL dataset used in [8] that has 485 730 unique URLs of which 16.47% are malicious, and the rest are benign. We store the malicious URLs in a filter using the benign URLs as set . In this case, we set and the proposed filters require 656 328, 679 584 and 734 697 memory bits for the TF, IF with and IF with constructions compared to 653 040 memory bits of a traditional filter. Again, the memory overhead is negligible (0.5%) for the TF construction and very small (4.1%) for the IF with . We verified that there are no false positives for the set in our filters while the traditional one suffers 1 358 false positives on average, based on one thousand constructions. This means that again, we are able to avoid false positives on frequently used URLs. There are other similar use cases of access/deny lists where our filters could be useful, such as for IP addresses or person names (e.g. no-fly list).

5.3 Bitcoin Simplified Payment Verification (SPVs) Nodes

Bitcoin defines two modes of operation for the nodes: Full nodes and Simplified Payment Verification nodes (SPVs)666https://developer.bitcoin.org/devguide/operating%5Fmodes.html. The full nodes store all the blocks in the Bitcoin blockchain (which currently requires more than 300 GBs) while the SPV nodes only store the block headers. This reduces the storage and bandwidth needed by SPV nodes. Since SPV nodes do not have the full data of the transactions, they need to get it from the full nodes when needed. Originally, when an SPV needed to retrieve a transaction, it downloaded the blocks until the transaction was found which was very inefficient. To reduce the overhead, a first solution was implemented by the SPV nodes constructing a Bloom filter with the identifiers of the transactions that they want to get and sending it to the full nodes. Then, the full nodes applied the filter to each block and whenever a positive was returned by the filter for any of the transaction identifiers on the block, the block was sent to the SPV node that issued the request777https://github.com/bitcoin/bips/blob/master/bip-0037.mediawiki. However, this scheme had privacy issues as full nodes can infer information from the SPV nodes [16] and the filters can also be used to launch denial of service attacks. This led to a new design in which full nodes add a filter to each of the header blocks so that now SPV nodes can use those filters to locate the blocks that contain the transactions they are interested in. Once the blocks are identified, the SPV node can download only those blocks from the full nodes888https://github.com/bitcoin/bips/blob/master/bip-0157.mediawiki. In this second version, Golomb Code Sets [17] are used instead of Bloom filters for the approximate membership checking999https://github.com/bitcoin/bips/blob/master/bip-0158.mediawiki.

The number of transactions per month in Bitcoin is approximately 20 million and as of today the number of existing transactions is around 700 million101010https://www.blockchain.com/charts/n-transactions-total. Therefore, one possibility could be to replace Golomb Code Sets with our filter with a false positive free set using the universe of existing transactions as the set free of false positives or if that is not possible avoiding for example false positives for the transactions in the last few days that are more likely to be checked by the SPV nodes. As the number of transactions per block is around 2500, we have constructed filters with of size 2500 and of sizes 700 million and 20 million to implement a FPFS on all existing transactions and on the last 30 day transactions respectively. The FPFS filters when we set , require in the first case (700 million) 62293 bits for the TF construction and 76245 and 65268 for the IF1 and IF2 constructions and in the second 46363 bits for the TF construction and 59262 and 46620 bits for the IF1 and IF2 constructions. This shows that the proposed filters can be used to completely avoid false positives on Bitcoin SPV nodes for all existing transactions or for the transactions done in the last 30 days.

6 Conclusions and Future Work

In this paper, a new type of approximate membership query filters that completely eliminate false positives for a given set has been proposed. Several constructions based on xor filters have been presented and evaluated for these filters with a false positive free set showing that it is indeed possible to avoid false positives for large sets. The benefits of the filters with a false positive free set have also been illustrated with three practical case studies covering different applications in computing and networking.

The constructions presented in this paper are also directly applicable to the ribbon filter [12], which offers more design flexibility than standard xor filters. Therefore, implementing the proposed filter with a false positive free set using ribbon filters is an interesting topic to continue this work. Similarly, incorporating the proposed filters in Bitcoin SPV nodes and evaluating their benefits over existing filters would also be of interest. More broadly, we expect the proposed filters to be beneficial in many other scenarios and thus identifying those and evaluating the use of the proposed filters on them could lead to significant performance improvements.

7 Acknowledgements

We would like to thank Mario Palomares for the discussions on the Bitcoin SPV case study. Pedro Reviriego would like to acknowledge the support of the ACHILLES project PID2019-104207RB-I00 and the Go2Edge network RED2018-102585-T funded by the Spanish Agencia Estatal de Investigación (AEI) 10.13039/501100011033 and of the Madrid Community research project TAPIR-CM grant no. P2018/TCS-4496. Stefan Walzer work was supported by the DFG grant WA 5025/1-1.

References

  • [1] M. Akiyama, T. Yagi, and T. Hariu (2013) Improved blacklisting: inspecting the structural neighborhood of malicious urls. IT Professional 15 (4), pp. 50–56. External Links: Document Cited by: §1.
  • [2] D. Belazzougui and R. Venturini (2013) Compressed static functions with applications. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’13, USA, pp. 229–240. External Links: ISBN 9781611972511 Cited by: 3rd item, §3.1, footnote 1.
  • [3] M. A. Bender, M. Farach-Colton, M. Goswami, R. Johnson, S. McCauley, and S. Singh (2018) Bloom filters, adaptivity, and the dictionary problem. In IEEE Symposium on Foundations of Computer Science (FOCS), Cited by: §1, §2.2, 1st item.
  • [4] B. H. Bloom (1970) Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13 (7), pp. 422–426. Cited by: §1, §2.2.
  • [5] L. Carter, R. Floyd, J. Gill, G. Markowsky, and M. Wegman (1978) Exact and approximate membership testers. In

    Proceedings of the Tenth Annual ACM Symposium on Theory of Computing

    ,
    STOC ’78, New York, NY, USA, pp. 59–65. External Links: ISBN 9781450374378, Link, Document Cited by: §3.2.
  • [6] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal (2004) The bloomier filter: an efficient data structure for static support lookup tables. In Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 30–39. Cited by: §2.1, §2.2, 2nd item, 4th item, §3.4.
  • [7] J. G. Cleary (1984-09) Compact hash tables using bidirectional linear probing. IEEE Trans. Comput. 33 (9), pp. 828–834. External Links: ISSN 0018-9340, Link, Document Cited by: §2.2.
  • [8] Z. Dai and A. Shrivastava (2019) Adaptive learned bloom filter (ada-bf): efficient utilization of the classifier. arXiv. External Links: 1910.09131 Cited by: §5.2.
  • [9] N. Dayan, M. Athanassoulis, and S. Idreos (2018-12) Optimal bloom filters and adaptive merging for lsm-trees. ACM Trans. Database Syst. 43 (4). External Links: ISSN 0362-5915, Link, Document Cited by: §1, §1.
  • [10] K. Deeds, B. Hentschel, and S. Idreos (2020-12) Stacked filters: learning to filter by structure. Proc. VLDB Endow. 14 (4), pp. 600–612. External Links: ISSN 2150-8097, Link, Document Cited by: §2.2.
  • [11] P. C. Dillinger and P. (. Manolios (2009) Fast, all-purpose state storage. In Proceedings of the 16th International SPIN Workshop on Model Checking Software, Berlin, Heidelberg, pp. 12–31. External Links: ISBN 9783642026515, Link, Document Cited by: §2.2.
  • [12] P. C. Dillinger and S. Walzer (2021) Ribbon filter: practically smaller than bloom and xor. CoRR abs/2103.02515. External Links: Link, 2103.02515 Cited by: 3rd item, §2.1, §2.1, §2.3, §3.8.2, §6.
  • [13] B. Donnet, B. Baynat, and T. Friedman (2006) Retouched bloom filters: allowing networked applications to trade off selected false positives against false negatives. In Proceedings of the 2006 ACM CoNEXT Conference, CoNEXT ’06, New York, NY, USA. External Links: ISBN 1595934561, Link, Document Cited by: §2.2, 1st item.
  • [14] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher (2014) Cuckoo filter: practically better than bloom. In ACM CoNEXT, Cited by: §1, §2.1.
  • [15] M. Genuzio, G. Ottaviano, and S. Vigna (2020) Fast scalable construction of ([compressed] static — minimal perfect hash) functions. Information and Computation 273, pp. 104517. Note: DCC (Data Compression Conference) 2018 External Links: ISSN 0890-5401, Document, Link Cited by: §2.3, footnote 1.
  • [16] A. Gervais, S. Capkun, G. O. Karame, and D. Gruber (2014) On the privacy provisions of bloom filters in lightweight bitcoin clients. In Proceedings of the 30th Annual Computer Security Applications Conference, pp. 326–335. Cited by: §1, §5.3.
  • [17] S. Golomb (1966) Run-length encodings (corresp.). IEEE Transactions on Information Theory 12 (3), pp. 399–401. External Links: Document Cited by: §5.3.
  • [18] T. M. Graf and D. Lemire (2020-03) Xor filters: faster and smaller than bloom and cuckoo filters. ACM J. Exp. Algorithmics 25. External Links: ISSN 1084-6654, Link, Document Cited by: §1, 3rd item, §2.1, §2.1, §2.3, §3.7, §4, §4.
  • [19] J. B. Hreinsson, M. Krøyer, and R. Pagh (2009) Storing a compressed function with constant time access. In Proc. 17th ESA, pp. 730–741. External Links: Document Cited by: footnote 1.
  • [20] S. Z. Kiss, É. Hosszu, J. Tapolcai, L. Rónyai, and O. Rottenstreich (2021) Bloom filter with a false positive free zone. IEEE Transactions on Network and Service Management (), pp. 1–1. External Links: Document Cited by: §1, §2.2.
  • [21] T. Kopelowitz, S. McCauley, and E. Porat (2021) Support optimality and adaptive cuckoo filters. In Proc. 17th WADS, Lecture Notes in Computer Science, Vol. 12808, pp. 556–570. External Links: Document Cited by: 1st item.
  • [22] H. Lang, T. Neumann, A. Kemper, and P. Boncz (2019) Performance-optimal filtering: bloom overtakes cuckoo at high throughput. VLDB Endow. 12 (5), pp. 502–515. Cited by: §2.1.
  • [23] L. Luo, D. Guo, R. T. B. Ma, O. Rottenstreich, and X. Luo (2019) Optimizing bloom filter: challenges, solutions, and comparisons. IEEE Communications Surveys Tutorials 21 (2), pp. 1912–1949. External Links: Document Cited by: §1, §2.1, §2.1.
  • [24] B. Majewski, N. Wormald, G. Havas, and Z. Czech (1996-01) A family of perfect hashing methods.. Comput. J. 39, pp. 547–554. Cited by: §2.3.
  • [25] U. Manber and S. Wu (1994) An algorithm for approximate membership checking with application to password security. Inf. Process. Lett. 50 (4), pp. 191–197. Cited by: §2.1.
  • [26] M. McIlroy (1982) Development of a spelling list. IEEE Transactions on Communications 30 (1), pp. 91–99. External Links: Document Cited by: §1, §5.1.
  • [27] K. Mehlhorn (1984) Data structures and algorithms 1: sorting and searching. Springer. External Links: Link Cited by: §3.2.
  • [28] M. Mitzenmacher, S. Pontarelli, and P. Reviriego (2020) Adaptive cuckoo filters. ACM J. Exp. Algorithmics 25. Cited by: §1, §2.2, §2.2, 1st item.
  • [29] S. Pontarelli, P. Reviriego, and M. Mitzenmacher (2018) EMOMA: exact match in one memory access. IEEE Transactions on Knowledge and Data Engineering (TKDE) 30 (11), pp. 2120–2133. Cited by: §1, §1.
  • [30] P. Reviriego, J. Martínez, D. Larrabeiti, and S. Pontarelli (2021) Cuckoo filters and bloom filters: comparison and application to packet classification. IEEE Transactions on Network and Service Management. Cited by: §1, 2nd item, §2.1.
  • [31] O. Rottenstreich, P. Reviriego, E. Porat, and S. Muthukrishnan (2021) Avoiding flow size overestimation in the count-min sketch with bloom filter constructions. IEEE Transactions on Network and Service Management (), pp. 1–1. External Links: Document Cited by: §2.2.
  • [32] S. Tarkoma, C. E. Rothenberg, and E. Lagerspetz (2012) Theory and practice of bloom filters for distributed systems. IEEE Communications Surveys Tutorials 14 (1), pp. 131–155. External Links: Document Cited by: §1.
  • [33] S. Walzer (2021) Peeling close to the orientability threshold: spatial coupling in hashing-based data structures. In Proc. 32nd SODA, Cited by: §2.3.