## Introduction to Bloom filters

A Bloom filter is a probabilistic data structure used to test set membership queries in constant time. Queries may return false positives, but never a false negative (thus classifying them as a false-biased Monte Carlo algorithm). Standard Bloom filters include insert and set-membership query operations, and lack element removal, iteration, and other features common in binary search trees, hash tables, or more common data structures that can be used similarly.

### Construction of a standard Bloom filter

A Bloom filter represents a set of items, and consists of unique hash functions and an array of bits. To add an element to the Bloom filter, compute its hash with each of the functions and set the bit at each index. (If the bit has been set previously, keep it set.) To query an element, compute its hashes and return “true” if all bits are set; return false otherwise. Figure 1 demonstrates a query operation.

An element not in the set could exist such that all of its
corresponding bits have been set by other elements. Querying would
return “true”: a *false positive*. A query will never mistakenly
report that an element is not a member of the set when it actually
is—in other words, there are no *false negatives*.

### False positives

The false positive rate of Bloom filters can be estimated [1] using

(1) |

where is the size of the Bloom filter, is the number of hash functions, and is the number of elements inserted. The approximate form is graphed in Figure 2.

To fix the false positive rate at FPR, the optimal number of hash functions to use [2] is

(2) |

### Example use case

Bloom filters are often used to reduce memory or storage requirements of
set membership data structures. An example problem in which Bloom
filters could be used is browser vendors protecting users from malicious
websites: while browser vendors could ship a list of known malicious
URLs, shipping the full list would require excessively large network
transaction and storage space. Instead, the browser vendor could include
in the browser package a Bloom filter containing that list, and browsers
could query the local Bloom filter quickly before loading any webpage.
If the query returns “true”, then the webpage can be checked via the
Internet against the exact list.^{1}^{1}1In fact, this is how Google
Chrome protects users from malicious sites. As of 2010, the full list
contained ~1 million websites, stored in an only 18 Mb
Bloom filter [3] [4]. If the query returns “false”, then
the webpage is not on the list—a query for an element actually on the
list will never return “false”.

### Theoretic advantages

The primary advantage of Bloom filters is how they markedly beat other data structures used for set membership queries in terms of memory overhead. Naive and deterministic implementations of set membership data structures (including binary search trees and hash tables) generally have to store the entirety of the elements that they represent. A significant improvement, Bloom filters represent an element with one bit per hash functions. (Typically, fewer than 10 bits are required to represent an element in a Bloom filter [5].) Furthermore, the number of elements in the filter does not need to correlate with the number of bits (or the size of the array) used to store that element.

There are some extensions to the common Bloom filter that make this advantage even more extreme (notably the Fuzzy-Folded Bloom filter, which can support ~1.9 times the elements of a standard Bloom filter while maintaining the same false positive rate and a constant time complexity in all operations).

A secondary advantage of Bloom filters is their ability to insert and query elements in constant time. Binary search trees can do neither in constant time. Hash tables can often only query elements in amortized constant time, and generally have a linear worst case for insertion. The latencies of Bloom filter operations are solely dependent on the number and complexities of the hash functions it employs.

Because of Bloom filters’ ability to quickly process elements with low space cost, they are particularly useful in problems that involve data streams [6] (or massive sets of elements with no defined upper bounds on size). For example, a massive hash table that is used to test set membership will eventually fill up and break when a boundless number of elements are inserted. While the load factor of a Bloom filter in the same scenario may become dangerously high, the core functionality of the data structure will always remain intact, even when inserting an extremely large number of elements.

### Limitations

The probabilistic nature of Bloom filters may make them unsuitable for certain tasks—though that can be mitigated by choosing an appropriate size of the bit array and number of hash functions to achieve sufficiently low false positive rates. By Equation 1, one can increase the size and manipulate (depending on ) the number of hash functions to decrease the false positive rate to an acceptable percentage. To eliminate false positives, the Bloom filter can be used as a preliminary check to eliminate negatives before checking against the exact list. (This is faster than checking everything against the exact list, since Bloom filter queries are in constant time and Bloom filters are small enough to fit in faster, limited-quantity memory (e.g., cache).)

The inability to remove elements from a Bloom filter makes the structure unfit for highly dynamic and volatile sets, where the membership of elements changes rapidly. (Even re-inserting elements into a new Bloom filter doesn’t work, since a Bloom filter cannot efficiently and precisely report the entire set; the hashing of elements and subsequent setting of bits is an irreversible operation.) However, there are several extensions on the common Bloom filter which add removal support—these will be discussed in the next section.

## Developments

In this section, we will discuss a few iterative improvements on the standard Bloom filter, as well as the dynamic Fuzzy-folded Bloom filter and the “practically better” Cuckoo filter. These improvements include addition of a removal operation, superior space complexity, and better hardware interaction in the interest of practicality. While all of these solve the same general sorts of problems, one may be better than another for specific use cases. Table 1 compares several properties of the discussed filters.

### Counting Bloom filter

The most primitive Bloom filter iteration supporting element removal is the Counting Bloom filter. In this variation, each bucket or bit is replaced by a number of bits (usually four), which serve as a counter. If an inserted element hashes to an index, that index is incremented. If an element is removed, then all the counters it hashes to are decremented. When querying an element, the query returns “true” if all the counters it hashes to are greater than 0; the query returns false otherwise. Disadvantages of this variation relative to the standard Bloom filter include the possibility of bit overflow (in the counters) and higher memory overhead (as it requires storing several bits for each array index instead of only one).

### Blocked Bloom filter

Blocked Bloom filters (like standard Bloom filters) do not support removal. However, they are constructed to exhibit high spatial locality by fitting several smaller Bloom filters individually into cache lines. They guarantee less than two cache misses on negative queries [7], where standard Bloom filters can have up to misses on negative queries. Thus, a Blocked Bloom filter can be a much more practical data structure than a general Bloom filter. However, for these improvements, implementing a Blocked Bloom filter requires some knowledge of the hardware.

### -left Counting Bloom filter

-left Counting Bloom filters use fingerprinting to support removal. They insert elements by computing their -left hash to store as a fingerprint—removal is accomplished simply by deleting the computed fingerprint. In this way, they are almost more similar to a standard hash table than a bloom filter. This structure’s space cost is up to twice as expensive as that of a standard Bloom filter, but it requires half the space of a Counting Bloom filter [5]. The idea of fingerprinting incorporated in this structure is also a concept fundamental to Quotient and Cuckoo filters.

### Quotient filter

Quotient filters are another extension of the standard Bloom filter which also use string fingerprinting to support removal. The main difference between Quotient filters and -left Counting Bloom filters is how Quotient filters use an extension on standard hashing called “quotienting” to efficiently hash and reconstruct the fingerprints of elements. In quotienting, the high and low bits of a computed fingerprint are partitioned; the low bits are then stored in a bucket indexed by the high bits. Due to this scheme, collisions generally display high spatial locality and therefore are hardware-friendly [8]. However, this structure still relies on linear probing in order to resolve collisions. Thus, Quotient filters cannot give the same guarantees on asymptotic time complexity as most other Bloom filter implementations, and its performance suffers at load [2].

Filter type | Space cost | Cache misses per lookup | Deletion support |
---|---|---|---|

General Bloom | 1 | No | |

Counting Bloom | Yes | ||

Blocked Bloom | 1 | No | |

-Left Counting Bloom | Yes | ||

Quotient | Yes | ||

Fuzzy-Folded Bloom | No | ||

Cuckoo | 2 | Yes |

### Fuzzy-folded Bloom filter

A Fuzzy-folded Bloom filter describes the continuous compression (“folding”) of two standard Bloom filters (created from a bipartition of the original array) each of size bits into a single, compressed filter with buckets. Exactly half of the space of the original array is allocated to this compressed Bloom filter, and the other half is used to support two new Bloom filters each of size bits [6].

In this, the Fuzzy-folded Bloom filter is a “dynamic” Bloom filter, growing to maintain a low false positive rate while accommodating further insertions. The fuzzy-folding operation does not break the invariants of the original filter (in that it will never introduce the possibility of false negatives), nor does it increase the false positive rate of the filter [6].

The fuzzy-folding operation overlays bits at the same position in both Bloom filters and uses fuzzy logic in each bucket of the product array to represent the compression of these filters. This process (and therefore compressed bit representation) is non-commutative; it is necessary to logically reconstruct the ordering to effectively query the filter [6].

Insertion is similar to standard Bloom filter insertion. There are always two non-compressed Bloom filters in the array. There is also a designated load threshold, applicable to both non-compressed filters. If the first array has yet to reach this threshold, bits are set in the first array (whose indices are determined by the element’s hashes). If the first array has reached this threshold, bits are set in the second array. If both arrays have reached this threshold, then they are fuzzy-folded, and two new Bloom filters are allocated with bits each (where denotes the number of fuzzy-fold operations conducted before this operation, starting from 0).

Queries operate by first checking the second non-compressed filter (and returning “true” when a standard Bloom filter would). If there is no match, the first non-compressed filter is then checked. If there is again no match, then the filters are sequentially queried from most recently to least recently compressed. Therefore, in the worst case, the time complexity of queries is linear in terms of the number of fuzzy-folds performed. Also, in all cases, the speed of queries suffers drastically compared to that of a standard Bloom filter when the size of the compressed arrays becomes sufficiently small [6].

The most notable advantage of Fuzzy-folded Bloom filters is how they can accommodate roughly 1.9 times the elements of a general, space-optimized Bloom filter while maintaining the same false positive rate [6]. In practice, this is a desirable trade-off, considering linear query complexity in terms of number of fuzzy-folds is not a significant limitation, or even comparable to linear in terms of inserted elements.

### Iterative Patterns

Bloom filter evolutions are built with practicality in mind: due to the real-world performance boost from cache optimization, many of these filters are constructed for high spatial locality. Likewise, many datasets are volatile, and so fingerprinting is commonly used to let a Bloom filter variant support removal without introducing unacceptable space overhead.

Cuckoo filters use many of the same paradigms and ideas of Bloom filters, fill the same niche (of performing highly space and time efficient set membership queries with no false negatives), yet approach a few key concepts in ways that differentiate them from Bloom filters.

Notably, Cuckoo filters support element removal. They accomplish this by using a fingerprinting scheme similar to that seen in -left Counting Bloom filters and Quotient filters, where deleting an element amounts to deleting its fingerprint. Specifically, removing an element from a Cuckoo filter is accomplished by checking both of the element’s possible buckets and deleting the fingerprint if it exists in either [2].

A Cuckoo filter is also more capable of taking advantage of its allocated space. Because of Cuckoo filters’ insertion scheme (illustrated in Figure 3), which sets a fingerprint in only one (compared to ) bucket per insertion and (in some implementations) allows buckets to contain multiple fingerprints, a load factor of without a marked increase in false positives is very achievable [2].

The final primary advantage of Cuckoo filters is in their simple design and construction. Compared to more complicated variations on Bloom filters that reduce time and space complexity at comparable margins, such as Blocked Bloom filters and Golomb-Compressed Sequences [2], the concepts and implementation details behind Cuckoo filters are relatively simple.

On the other hand, Cuckoo filter insertion is arguably worse than that of a standard Bloom filter. The process of insertion has the same worst case as general Cuckoo hashing, where all buckets for the hashed fingerprint are occupied, leading to a chain reaction of displacements throughout the entire table. Nevertheless, Cuckoo filters still maintain amortized insertion [2].

Cuckoo filters also have an upper bound on the number of times one fingerprint can be inserted: if the Cuckoo filter has buckets of size , then elements with the same fingerprint can be inserted at most times [2]. Disabling the removal operation can overcome this limitation, but removal is a sought-after feature in this niche.

To maintain an acceptable false-positive rate, fingerprint size must scale with the size of the filter (or its number of buckets). Nevertheless, this is generally acceptable, as Cuckoo filters are more space efficient than the standard Bloom filter at low false positive rates (~) [2].

## Bloom Filters in Practice

We give an in-depth discussion of a new *de novo* genome assembler,
ABySS 2.0, which utilizes Bloom filters to trivialize memory
requirements without sacrificing speed or accuracy, and increases DNA
sequencing throughput. This and further advancements could revolutionize
preventative healthcare, and affect the lives of many. We continue to
list many of the ways Bloom filters are used in everyday life to make
efficient otherwise difficult procedures.

###
*De novo* genome
assembly

Genomics research—the field of characterizing genomes to better
understand similarities and differences among species, or even
individuals—has seen much development in the past decade in part
thanks to growing and planned personalized medicine
initiatives.^{3}^{3}3“Personalized medicine” is a medical model which
tailors medical decisions, practices, interventions and/or product
usage to the individual patient’s predicted risks. Because of this
focus on the individual, and given the large part genetics play in
individual health, the field depends on sequencing genomes *en
masse*. The spread of personalized medicine may be a major boon to
preventative healthcare, especially as sequencing becomes faster and
more affordable. During this time, the DNA sequence throughput of the
industry’s best instruments has constantly increased [1]. In
particular, sequence assembly, which refers to aligning and merging
fragments read or copied from a longer DNA sequence in order to
reconstruct the original sequence,^{4}^{4}4DNA sequencing technology,
biological or synthetic, cannot read whole genomes (on the order of 7
billion basepairs split among 23 pairs of chromosomes in humans) in
one pass. Instead, genome sequencing works by copying or listing the
bases in a short (20–30000 bases) “reads” and then combining them.
*De novo* sequence assembly, which constructs genomes without a
backbone or template, combines these reads using overlaps as
indicators of originally adjacent sequences. Figure
5 illustrates the *de novo* assembly process.
has seen drastic reductions in both time and spatial requirements with
the incorporation of Bloom filters and related algorithms in assemblers.

The genome assembler ABySS is example of the efficiency of Bloom filters
in *de novo* genome assembly: whereas ABySS 1.9 (which didn’t use a
Bloom filter) could assemble the human genome in using a
whopping of memory (across many machines), ABySS
2.0 can, using the same parameters, assemble the human genome in
with a mere ^{6}^{6}6The absolute
time for sequencing doesn’t matter: this is about throughput. This is
an order of magnitude increase in throughput given the same resources,
which matters for real-world applications. Roughly 10 times the number
of patients can have their DNA sequenced using this sort of algorithm.
[1]. ABySS 2.0 achieves this performance increase by consolidating
its usage to a single machine (eliminating the need for messaging) and
instead represents a De Bruijn graph using a Bloom filter. In this
context, a De Bruijn graph stores all possible length-
sequences (called “-mers”, which are length portions of the
longer read) made up of symbols (bases) from
in the vertices

and all four possible “successor” sequences (where the first base is removed, the rest shifted left, and another base is appended) are represented in the edges

As described in Figure 5, ABySS 2.0 represents the
vertices of the De Bruijn graph with a Bloom filter by setting bits
corresponding to the vertices (-mers). These -mers are the
length- sequences of a short read, so in this way the Bloom filter
contains the entire read. With all reads stored in the Bloom filter,
ABySS 2.0 combines them to reconstruct the original sequence. The
algorithm repeatedly queries the Bloom filter to discover De Bruijn
graph edges (which lead to predecessor or successor -mers). Since
-mers are relatively short and are only part of the read, this
process may lead to representation of false edges in the graph;
fortunately, using look-ahead mechanisms, these branches are trimmed if
they do not continue for more than nodes. (This look-ahead
mechanism increases graph traversal cost, but eliminates the need for
additional data structures.) As reads extend and more likely represent a
correct path in the De Bruijn graph, they are branded “solid reads”
and considered to be part of the original sequence. Through this
process, the entire genome may be reconstructed from short reads with
high accuracy.^{8}^{8}8The accuracy is “on-par with other assemblers”
[1], and need not be perfect; even the biological process of DNA
transcription introduces error. Additionally, DNA has some protections
against error in that many codons encode the same meaning, as well as
the fact that much of the genome has little-to-no known effect.

Table 2 shows how ABySS 2.0 compares to other
genome assembly algorithms in resource consumption. Notably, the tools
which efficiently represent the De Bruijn graph (ABySS 2.0, MEGAHIT,
Minia, and SGA) require much less memory than those that do not.
BCALM^{9}^{9}9BCALM uses a novel method of partitioning the De Bruijn
graph to be resource efficient, and does not use Bloom filters.,
while extremely resource light, sacrifices sequence contiguity compared
to the rest of of the listed assemblers. ABySS 2.0, using Bloom filters,
achieves a marked improvement over ABySS 1.9 and the rest of the pack in
efficiency, while achieving contiguity results on par with
DISCOVARdenovo and ABySS 1.9. Note that Minia also uses Bloom filters,
and ABySS 2.0 is largely based on Minia, with three novel features: (i)
the use of solid reads, (ii) a look-ahead mechanism to eliminate false
positives (as opposed to a seperate data structure), and (iii) a new
hashing algorithm designed for DNA/RNA sequences. The authors of ABySS
believe that there is still great opportunity for improving throughput
without sacrificing contiguity—i.e., the algorithmic ideas of BCALM
could be adapted to produce a more contiguous result [1]. (And
perhaps Bloom filters are not the absolute most efficient means of
representing a De Bruijn graph.)

Assembler | Memory (GB) | Time (h) |
---|---|---|

ABySS 1.9 | 418 | 14 |

ABySS 2.0 | 34 | 20 |

DISCOVARdenovo | 618 | 26 |

BCALM | 5 | 9 |

MEGAHIT | 197 | 26 |

Minia | 137 | 19 |

SGA | 82 | 65 |

SOAPdenovo | 659 | 35 |

### Networking

Bloom filters are used in device discovery: if two previously paired devices meet again under different circumstances, they can skip pairing again (which would be unnecessary, since they have previously done so). Devices build a Bloom filter of devices they have paired with, and when attempt to connect with another device, send the list to the second device. If the second device recognizes one of its identifiers in the Bloom filter, it responds to the first device that they have paired previously, which initiates the mutual connection. This process, described and patented in [9], is used in Qualcomm devices (i.e., many or most cell phones and other portable devices) for ad-hoc network discovery (e.g., Bluetooth, WiFi direct, 802.xx wireless LAN).

### Making the world go ’round

Facebook [10] uses Bloom filters to represent the social graph for typeahead search in order to display friends and friends-of-friends of the user’s query. The Bloom filter uses 16 bits per friend connection (or graph edge).

Yahoo! Mail [11] uses Bloom filters to represent email contact list since the Bloom filter can fit in browser cache. This obviates the need for round-trip connections to Yahoo for verifying delivered emails are from contacts.

Tinder [12] [11] uses Bloom filters to record “right swipes” (accepting a user as a possible match) in order to remove users from the incoming list. When eventually the list is refreshed for new users to swipe on, those previously right swiped on will be filtered out. Some unseen users are filtered out by this process, but, as the saying goes, there are plenty of fish in the sea.

URL shorteners [11] employ Bloom filters to generate unique URLs: if a shortened URL has been previously used, it exists in the Bloom filter. Thus, by querying the Bloom filter with different shortened URLs until receiving a “false” response, the service can ensure unique URLs.

YouTube uses Bloom filters [11] to ensure recommended videos are not in the user’s watch history, in addition to the algorithms that optimize for relevance metrics, to feed users new and interesting content.

## Conclusion

The general Bloom filter is a powerful data structure thanks to its simplicity and its time and memory efficiency. Guaranteed constant time insert and query operations are rare among data structures, and often worth the trade-off of false positives in set membership problems. A more serious limitation for particular problems is the lack of a removal operation, which is overcome by various evolutions of the Bloom filter, and the Cuckoo filter. For many problems, one of these evolutions or alternatives may be better suited depending on the circumstantial resources and constraints.

Bloom filters, though simple and often hidden behind-the-scenes, have a profound and increasing effect on the modern world. This data structure makes efficient dating, entertainment, and networking—both social and digital—possible, as well as leading healthcare to be able to personalize treatment to the individual.

## References

[1] S. D. Jackman, “ABySS 2.0: Resource-efficient assembly of large
genomes using a bloom filter,” *Genome Research*, vol. 27, no. 5,
pp. 768–777, 2017.

[2] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher,
“Cuckoo filter: Practically better than bloom,” *Proceedings of
the 10th ACM International on Conference on emerging Networking
Experiments and Technologies - CoNEXT 14*, pp. 75–88, Dec 2014.

[4] S. Hess, “Issue 10896048: Transition safe browsing from bloom
filter to prefix set. - code review,” *Appspot.com*. Jan-2012.

[5] F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G.
Varghese, “An improved construction for counting bloom filters,”
*Lecture Notes in Computer Science Algorithms – ESA 2006*, pp.
684–695, 2006.

[6] A. Singh, S. Garg, K. Kaur, S. Batra, N. Kumar, and K.-K. R.
Choo, “Fuzzy-folded bloom filter-as-a-service for big data storage on
cloud,” *IEEE Transactions on Industrial Informatics*, pp. 1–1,
Jun 2018.

[7] F. Putze, P. Sanders, and J. Singler, “Cache-, hash-, and
space-efficient bloom filters,” *Journal of Experimental
Algorithmics*, vol. 14, p. 4.4, Dec 2009.

[8] M. A. Bender, “Don’t thrash,” *Proceedings of the VLDB
Endowment*, vol. 5, no. 11, pp. 1627–1637, Jul 2012.

Comments

There are no comments yet.