RAMBO: Repeated And Merged Bloom Filter for Multiple Set Membership Testing (MSMT) in Sub-linear time

10/07/2019 ∙ by Gaurav Gupta, et al. ∙ Amazon Rice University 0

Approximate set membership is a common problem with wide applications in databases, networking, and search. Given a set S and a query q, the task is to determine whether q in S. The Bloom Filter (BF) is a popular data structure for approximate membership testing due to its simplicity. In particular, a BF consists of a bit array that can be incrementally updated. A related problem concerning this paper is the Multiple Set Membership Testing (MSMT) problem. Here we are given K different sets, and for any given query q the goal is the find all of the sets containing the query element. Trivially, a multiple set membership instance can be reduced to K membership testing instances, each with the same q, leading to O(K) query time. A simple array of Bloom Filters can achieve that. In this paper, we show the first non-trivial data-structure for streaming keys, RAMBO (Repeated And Merged Bloom Filter) that achieves expected O(sqrt(K) logK) query time with an additional worst case memory cost factor of O(logK) than the array of Bloom Filters. The proposed data-structure is simply a count-min sketch arrangement of Bloom Filters and retains all its favorable properties. We replace the addition operation with a set union and the minimum operation with a set intersection during estimation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Approximate set membership is a fundamental problem that arises in many high-speed memory-constrained applications in databases, networking, and search. The Bloom Filter [bloom1970space, mitzenmacher2002compressed, cohen2003spectral] is one of the most famous and widely adopted space-efficient data structures for approximate set membership. It allows constant time, i.e., membership testing in mere bits space, where is the cardinality of the set

under consideration. Bloom Filters trade a small false-positive probability for an impressive query time and memory trade-off. Compared to other sophisticated hashing algorithms 

[networkapplications, simpleMPH], the simplicity of Bloom Filters and the ability to cheaply insert elements on the fly makes them quite successful in many latency and memory-constrained applications.

In this work, we are interested in a related multiple set membership testing (MSMT) problem. Here, instead of one set , we are given a set of sets , where and is the universe of the keys. Given a query , our goal is to identify all the sets in that contain . In particular, we want to return the subset , such that if and only if .

A recent nature paper “BItsliced Genomic Signature Index (BIGSI)” [bigSI], motivates the MSMT problem. Here, the authors indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using arrays of Bloom Filters. In this application, the authors treated every bacteria genome sequence as a set of all -mers (continuous -characters strings) in the sequence. There are 447,833 such sets which collectively occupy 170 Terabytes of space. Given a query -mer , identifying all the genome sequences that contain is an important prerequisite for numerous computational biology pipelines. For instance, BIGSI applied this functionality to rapidly finding resistance genes MCR-1, MCR-2 and MCR-3, determining the host-range of 2,827 plasmoids and quantifying antibiotic resistance in archived datasets. This is precisely an MSMT problem. This work also stressed a few important properties of Bloom Filters that made MSMT at scale possible. The ability to perform the streaming updates, manipulation of simple bit arrays, and bitwise operations to obtain the union and intersection were critical for scaling the system. BIGSI shows the biggest possible scale (170-Terabyte) that has been achieved for MSMT in practice. The work mentions larger datasets and the need to scale BIGSI to handle millions of datasets, leading to hundreds of millions to billions of sets.

The authors use one Bloom Filter to compress each set (or the bacterial genome sequence) independently. Essentially the MSMT instance is reduced to classical membership testing instances with an expensive query complexity of . Unfortunately, the query complexity , when grows to hundreds of millions, is prohibitively expensive. We will call this approach Arrays of Bloom Filter (ABF).

Motivated by BIGSI scale, our paper is concerned with streaming data structures only, where the data structure can be updated incrementally analogous to Bloom Filters. In the streaming setting, we observe a set of inputs one at a time. Upon observing an element , we must perform an online update to the data structure. We are not allowed to remember after the update is complete and we are not allowed a second pass through the dataset. The streaming setting is also applicable to the type of web-scale problems that arise in industry-scale product recommendation.

Additionally, we would like to retain the benefits of manipulating bit arrays. Keeping in mind these constraints, the trivial solution of independent Bloom Filter seems to be the best. This is likely the reason for its use in practice and recent application in BIGSI.

The main focus of this paper is to reduce the query time of MSMT while still maintaining the desirable properties associated with Bloom Filters. In particular, we are looking for a method with an efficient query-time algorithm that also has the following three properties: 1. Low false positive rate, 2. Cheap updates for streaming inputs, and 3. A simple, systems-friendly data structure that is straightforward to parallelize. Existing solutions with fast query-time algorithms are either space inefficient or require complex structures such as minimal perfect hash constructions that are neither simple to implement nor streamable. In this work, we propose RAMBO, a data structure with sub-linear query time that has these properties.

1.1 Our Contribution

We show the first non-trivial data structure, RAMBO (Repeated And Merged Bloom Filter), for the MSMT problem. RAMBO achieves sub-linear query time and at the same time satisfies the following three needed properties analogous to arrays of Bloom Filter.

  • The data structure only consists of bit arrays which can be updated incrementally on a stream analogous to Bloom Filters.

  • The query operation only requires hashing and bitwise AND/OR operations.

  • The proposed architecture has zero false negatives.

Our proposed idea is a count-min sketch (CMS) [CMS, MACH] where each cell represents a Bloom Filter. During collisions, instead of addition, we perform the union of Bloom Filters. During inference, we take the set intersection instead of the minimum used by CMS. The proposed data-structure is an ideal illustration of the power of count-min sketch beyond frequency estimation, and instead, demonstrates how we can compress non-trivial objects. The idea presented could be of independent interest in itself. In a nutshell, our main result is

Theorem 1.

For number of sets in a multiple set membership problem, where each key is a member of at most constant number of sets, our data structure solves the approximate multiple set membership testing problem using query time

and using at most extra bits per key than Array of Bloom Filters.

2 Notations

Symbol Notation
Number of sets
Total number of insertions in standard Bloom Filter array
Key
set
Universe of the keys
False positive rate of a Bloom Filter
Size of a table in RAMBO
Number of tables in RAMBO
Max number of sets a key belongs to
Failure rate of RAMBO
Table 1: Notations used in our paper.

3 Preliminaries

Here, we introduce preliminary concepts that will be needed in our formal analysis of RAMBO.

Universal Hashing

A 2-universal hash function family is defined as

Definition 1.

A family of functions such that is called a 2-universal hash family if

The following fact is of particular importance for our analysis. If we construct a hash table with range size using a 2-Universal hash function, then every bucket of has expected size of , where is the number of insertions into the hash table. Most of the practical hash functions like murmurhash [horvath] are approximately 2-universal.

Bloom Filters

The Bloom Filter is an array of bits which represent a set of elements. It is initialized with all bits 0. During construction, it uses hash functions , each with range , to set bits at the respective locations for a key . In this process, a bit in the Bloom Filter may be set to 1 multiple times, which does not effectively change the state of the Bloom Filter. Once the construction is done, the Bloom Filter can be used to check the membership of a query by calculating the AND of the bits at the locations: . The output will be if all the locations are and it will be if at least one is . Given this structure, it is straightforward to prove that there are no false negatives as every key will set all the bits at locations . However, there are false positives introduced because of the bits for the hash location of may have been set by keys from the set . The false positive rate of the Bloom Filter is given by:

To minimize the false positive rate, the number of hash functions () and the size of Bloom Filter should be:

where is the false positive rate. The Bloom Filter grows in size linearly with the cardinality of the set it represents. They are a constant factor of away from the information-theoretic lower bound on size. Although this is sub-optimal, the Bloom Filter enjoys widespread practical use due to its simplicity and systems-friendly properties.

Count Min Sketch

Our proposed architecture takes inspiration from the count-min sketch (CMS) [CMS], which is a probabilistic data structure first introduced to identify frequent elements, or heavy hitters, in a data stream. The CMS is a array of integer counts that are incremented online in a randomized fashion. Each input to the CMS is hashed using universal hash functions to one of the rows in each of the columns of the sketch. To perform a streaming update, we find the hash values of an element and increment the counts at those locations (or cells). During query-time, we want to determine the unknown frequency count of an element . We find the cells in the array that maps to under the hash functions. This results in a set of counts. The estimated frequency count of is the minimum over the set of count values.

To understand how the CMS works, let us suppose we have a heavy-hitting element and consider the situation where has a hash collision with another element in one of the columns. The final count value for in this column is the sum of the counts of and . If and are both heavy hitters, this is a bad overestimate. Otherwise, the count value is approximately correct. The key observation is that we are unlikely to report dramatic overestimates because we take the minimum overall columns. Intuitively, we are unlikely to have “problematic” hash collisions in every column if we assume that only a few elements are heavy hitters. This condition is common in the sketching literature [charikar2002finding, liu2016one, braverman2016beating] and is sometimes referred to as a power-law or sparsity assumption. Under this assumption, we have strong bounds on the error of our recovered estimates [CMS].

4 Formal Problem Statement

We formally introduce the multiple set membership problem. Our problem is fundamentally the same as the one introduced by [mitzenmacher2016omass], but we have a slightly different notion of the failure probability. Our formulation of the problem is motivated by recent problems in web-scale search and DNA analysis for which our definition is more appropriate.

Definition 2.

Approximate Multiple Set Membership Testing (MSMT)
Given a set of sets , construct a data structure which, given any query , reports the correct membership status of for every with probability .

Here, we define the failure probability as the probability of reporting an incorrect membership status for even one set from . This is in contrast to [mitzenmacher2016omass], which interprets the failure rate on a per-set basis. Otherwise, the two problems are equivalent. It should also be noted that in the context of Bloom Filters and RAMBO, the failure probability is equivalent to the false positive rate.

5 Known Approaches

Array of Bloom Filters (ABF)

As argued before, one straightforward approach to the multiple set membership problem is to reduce the problem to instances of the classical membership testing problem. For this, we store an array of different set representations [bigSI], one for each set. Many data structures can be used to encode the representations, as there are a wide variety of approaches to the approximate set membership problem. For simplicity and concreteness, we consider arrays of Bloom Filters, since they are theoretically well-studied and are overwhelmingly common in practice.

The analysis for an array of Bloom Filters (ABF) is straightforward. Let filter be associated with set and note that it requires space. Assuming that each filter has the same false positive rate, the ABF requires bits. The per-set false positive rate is , and the query cost is linear in the number of sets .

Method Query Time Worst-case Memory Comments
Inverted Index
Size of all keys in
Intractable memory,
no streaming, requires
minimal perfect hash
Bloom Filter Array
bits
Widely used in
practice, state of the art
RAMBO
bits
Our proposal, uses less
memory in expectation
Table 2: Summary of related work on the MSMT problem. Here, is the per-cell false positive rate.

Inverted Index

The inverted index  [witten2016data, baeza1999modern, witten1999managing] is an exact database query solution which stores a posting list for each key in the database. In our case, the posting list is the list of set IDs where the given key is present. The total memory utilization of all the posting lists is , where is the bit precision of the set IDs. However, this does not guarantee constant query time by itself. The implementation of the inverted index determines the query time. To have queries, the posting list against the keys should be implemented as a constant-time lookup key-value store. One way to implement this data structure is with a hash table with no collisions, essentially using a minimal perfect hashing scheme. Using the best-known method, this only adds additional 2.07 bits per key. However, all existing minimal perfect hash functions require prior knowledge of all keys and therefore cannot be constructed in the streaming setting. In the context of MSMT, the size of the inverted index is , where is the bit precision needed to store the posting lists. This is far greater than a simple Bloom Filter array in practice. For a large number of sets the bit precision .

6 Intuition Behind our Approach

We draw inspiration from the popular count-min sketch (CMS) data structure for streaming data and the literature of sparse recovery. Assume that we reduce sets into sets (meta-sets) by randomly combining sets and taking the union. Now, given a query , we only need to perform set membership testing on a small number () of meta-sets. If a merged set contains , then we know that is a member of one of the sets we merged. In expectation, this reduces the number of candidates from to sets. If we independently repeat the process and request another set membership query over a different table of merged sets, then in expectation we have reduced the potential confusion space to using only two small set membership queries. We see that this goes down exponentially with more repetitions.

We propose a twist to the CMS. We replace the counters in the CMS with Bloom Filters. During sketch construction, instead of addition, we merge sets by taking union in the cell during sketch construction. We refer to our merged Bloom Filter cell as a Bloom Filter Union (BFU) since each cell represents the merged union of sets.

To perform membership testing on this structure, we query each of the BFU cells. If an element is present in a particular set, the corresponding BFUs will be active. If not, some of the BFUs may be active, but there is a strong likelihood that at least one will be inactive. To solve the MSMT problem, we only report sets for which every BFU was active. The intuition is that “unlucky” hash collisions and false positives occur relatively infrequently. The recovery operation is, therefore, an intersection over the active BFUs.

7 RAMBO: Repeated And Merged Bloom Filters

The proposed architecture is an array of tables, each containing cells. The tables in RAMBO are independent to each other as we use independent hash function for each of them. The sets ( in number) are being hashed into one cell in each table (i.e. hashed times). Note that due to the universality of the hash functions, every cell of a table in RAMBO contains an expected random group of sets from the set of sets .

Figure 1: The figure shows the RAMBO architecture. The construction of Table 1 is highlighted. Here the sets are randomly merged (via a 2-universal hash function) into cells. Each cell is a Bloom Filter of the union of merged sets. In the architecture there a such tables containing bucket each.

Now each cell is a joint representation of a group of sets, rather than an individual set representation as in the Array of Bloom Filter architecture. To resolve the individual identity of the set we use repetitions of the table, each with independent random groupings (details in section 7.1). Just like in CMS, the multiple tables will help in singling out a common set (or sets) which represent the query. Section 7.2 explains the details about the set resolution in detail.

7.1 Creation and Insertion

RAMBO has a total of tables with cells each. We use universal hash function (section 3) analogous to CMS for creating the cells. Unlike CMS which combines elements colliding in the same cell by addition, we instead have a Bloom Filter for the union of sets colliding in the same cell. The size of this union of Bloom Filter (BFU) is dependent on the number of unique keys () in the cell.

The union is over the number of sets ( here) in a cell. The cardinality of the union will be discussed in section 8.1.
Our RAMBO data structure is merely a CMS (of size ) of Bloom Filters. Clearly, this structure is conducive to updates with a data stream. Every new key is hashed into tables with its set ID. We can also avoid fixed size Bloom Filter by replacing them with scalable Bloom Filter [scalableBF] architecture.

7.2 Notion of Separability

Since each cell represents more than one set, we intuitively need multiple repetitions to distinguish between two given sets. We formalize this notion as set separability - that is, the ability of RAMBO to determine which set contains the query from the BFUs.

Definition 3.

Separability
We say that two sets and , where are separable if for any , the RAMBO architecture returns the output and with non-zero probability.

We will see in section 8.2 that separability and the false positive rate depend on the and RAMBO parameters.

7.3 MSMT Queries

A query can occur in multiple sets due to its multiplicity. The RAMBO architecture shown in figure 2 returns at least one cell per table at the query time. Due to the possibility of false positives, it can return multiple cells per table. This is a slight departure from standard CMS. Our query process, therefore, is to first take the union among the returned cells from each table followed by the intersection of all those unions across tables. The union and intersection can be accomplished using fast bitwise operations. Algorithm 1 summarizes the flow. The extra query time incurred during this process is analyzed in section 8.3

. This vector

(as the problem definition suggest) is the final output- membership vector of the given query.

Figure 2: For a given query each table of RAMBO returns one or more BFUs (represented by dots) where the membership is defined. The membership for sets for each tables is defined by union. It follows by the intersection of the returned sets, which define the final membership of the query .
  Input: query
  Architecture: RAMBO (size: )
  Result: , where
  for  do
     for  do
         testMembership() { has returned sets}
     end for
      getUnion() { has the union of returned sets}
  end for
   = getIntersection() {final returned set ID’s}
Algorithm 1 Algorithm for membership testing using RAMBO architecture

8 Analysis

RAMBO is a matrix () of Bloom Filters it enjoys all the properties of arrays of Bloom Filter, i.e., streaming updates and bitwise operations. RAMBO has two important parameters and , which control the memory and query time and also affect the false positive rate. In the remaining part of the paper, we show that there is an optimal choice of R and B for sub-linear query time.

8.1 Memory analysis

Lower Bound

At first glance, it may seem possible that RAMBO can represent the set membership information in using less than space. For instance, if the hash function merges , then it seems that we save space by storing only one copy of . Unfortunately, this intuition is wrong. Any data structure which solves the set-of-sets problem must solve an equivalent set membership problem that has an information-theoretic lower bound of bits [carter1978exact].

Lemma 1.

Multiple set membership memory bound
Any data structure which solves the MSMT problem requires the following space:

Where is the false positive rate of the membership testing from this data-structure.

Proof.

Let be the set of all pairs for which and . Then . Any data structure which solves the set-of-sets membership problem must identify whether a given pair is in . Therefore, the data structure solves the approximate set membership problem and requires space by the information theoretic lower bound [carter1978exact]. ∎

Worst Case Memory Bound

To obtain a worst-case memory of RAMBO, we observe that the memory used by one repetition (one table) is upper bounded by the memory of the Array of Bloom Filters. This easily follows from the fact that each of the pairs from Theorem 1 is inserted into one of the BFUs. Using this fact, we arrive at the following worst-case memory bound.

Lemma 2.

Worst case memory bound
The maximum memory needed by a RAMBO data structure in worst case is bounded by

Where comes from the fact that number of repetitions in the order of (refer to theorem 2), where is the number of sets in our problem. This result means that RAMBO will use at most extra bits per element when compared to the ABF method.

Expected Memory

In practice, the actual memory usage for RAMBO is in between the best and the worst-case memory bounds. Here, we provide an average-case analysis under a simplifying assumption to analyze the expected performance of our method. To make the analysis simpler we assume that every key has fixed duplicates . This means every item is present in number of sets. We define the memory of RAMBO as

Lemma 3.

For the proposed RAMBO architecture, with size , and data where every key has number of duplicates, the expected memory requirement is

Where

The expectation is defined over variable which takes values from {1,2…V}. Refer appendix A. The factor is . To give an idea of how varies. Following is the plot of the function with respect to :

Figure 3: Plot of for and . The factor is less than value 1 for as there are duplicates of the keys. This factor gives the idea of memory saving for one table.

We will prove the expression of and, for any and in appendix section A.

8.2 False Positives

In this section, we express the total false positive rate of RAMBO in terms of , , , and the false positive rate of each BFU . Our goal in this section is to prove that we achieve a given failure rate using only repetitions. We begin by finding the failure probability for one set and extend this result to all sets.

Our first claim is that RAMBO cannot report false negatives. This follows trivially from the hash-based merging procedure and the fact that each BFU cannot produce false negatives. If an element , then each BFU containing reports membership and therefore we must report according to Algorithm 1.

The per-set false positive analysis is more involved. Suppose we have an element . For RAMBO to report that , we require the BFU containing to report membership in each column. There are two ways for this to occur. The first is for to randomly collide with a set that does contain under the hash mapping. Alternatively, the BFU containing can have a false positive due to the underlying Bloom Filter structure. To illustrate, consider the example in Figure 4. Set maps to the BFUs marked by dots, and BFUs containing true positive sets are shaded. If BFUs in columns 1, 3, 4, and 6 were to all have false positives, then we would incorrectly report that . By formalizing this argument, we obtain a lemma that relates the RAMBO parameters to the probability of incorrectly reporting whether .

Figure 4: RAMBO false positive rate. Suppose a set maps to the cross-marked BFUs under the hash function. To have a false positive for a query , these BFUs must either return a false positive or collide with a set that contains (shaded)
Lemma 4.

Individual false positive rate
Given the RAMBO data structure with BFUs having a false positive rate of and a query , we assume that belongs to no more than sets. Under these assumptions, the probability of incorrectly reporting that when is

where is the individual false positive rate of the BFUs.

The proof is a straightforward extension of the intuition presented earlier. False positives can be introduced by BFUs that violate the separability condition or by BFUs that happen to have a false positive. Using this theorem, we can construct the overall failure probability for the multiple set membership problem. Recall that we define failure as being incorrect about even one of the sets. By the probability union bound, we can set the individual false positive rate such that we have a bounded overall failure rate.

Lemma 5.

RAMBO failure rate
Given a RAMBO data structure with BFUs having a false positive rate of and a query , we assume that belongs to no more than sets. Under this assumption, the probability () of reporting an incorrect membership status for any of the sets is upper bounded by

where is the individual false positive rate of the BFUs.

A direct consequence of lemma 5 is that we need sub-linear RAMBO repetitions to obtain any overall failure probability .

Theorem 2.

Given a failure rate , we need

Proof.

From lemma 5, we can state that

Where and is the false positive rate of each BFU. Observe that if , then . For and small , this condition approximately requires that . Later, we will take in theorem 3; therefore, for sufficiently large and bounded it suffices to take . ∎

8.3 Query time analysis

The objective of this section is to show that RAMBO achieves expected sub-linear query time. To query the set membership status of an element , we first need to compute Bloom Filter lookups. After we have looked up in each BFU, we find the union over each row and intersection over each column of the sets that hashed to the “active” (returned True) BFUs. We report all sets which survive the iterative union and intersection procedure, as these are the sets with an active path through RAMBO. To find the query time, we need to know the cost of performing these unions and intersections.

Since each column partitions the sets, the union operations do not require any computational overhead. The set intersections between columns, however, require operations, where is the set of all active sets in the present column and is the set of all active sets in the next column. Since there are columns, the total cost for the intersection is . By observing that , we obtain the following theorem.

Lemma 6.

Expected query time
Given the RAMBO data structure with BFUs and a query that is present in at most sets, the expected query time

Here, is the number of sets and is the false positive rate of each BFU.

By taking , and according to Theorem 2, we get an expression for the query time in terms of the overall failure probability and the total number of sets . Our main theorem is a simplified version of this result where we omit the lower order terms.

Theorem 3.

MSMT query time
Given a RAMBO data structure and a query that is present in at most sets, RAMBO solves the MSMT problem with probability in expected query time

Here, is the number of sets and is the false positive rate of each BFU and is independent of and .

9 Conclusion

We propose a simple yet elegant data-structure RAMBO for Multiple Set Membership Testing problem. Many practical problems like search in resource constraint settings and sketching based data-structure for fast sampling can be solved using RAMBO. We have proved the query time to be sub-linear () in number of sets . We have also seen that the memory requirement of the RAMBO is times the size of the Array of Bloom Filter in expectation. Given that there are duplicates of keys (which occurs in practice), the additional memory factor is minimal. Moreover, the proposed data-structure can be easily parallelised and supports insertions of streaming data keys. In the future, it will be interesting to analyse RAMBO in the case of different multiplicity () for every key.

References

Appendix A Appendix

The proof of lemma 3 is as follows:

With . The size of the union is . Where , are the total number of insertions. is a set. Now if we divide these sets into groups (by using a universal hashing method to hash sets into bins), we will get union of random group of sets with every key in a bin has varying number of duplicates . . 0 duplicate means the element does not exists and 1 duplicate means the element has only one copy. This implies that

is a random variable with some distribution. We are going to find that distribution.


The size of a bucket is given by :

Where, there are N1 number of insertions in bucket, is a random variable, . By linearity of expectation, we can say that

We can see as a multiple count reduction factor, which works as a normalizer for multiplicity of the keys. After randomly dividing the sets into cells, we will analyze the probability () of having number of duplicates in a bucket of a hash table of size .
An element can have at-most duplicates or it can have no presence in the bucket. This problem is similar to putting balls in bins. The probability of having all balls in one bucket is given by .
The probability of having all balls in one bucket and remaining one in any other bin in given by
Hence the probability of getting balls in one bucket and in remaining others is given by:

From the expression of and and as The expected size of a bucket can be written as-

The expected size of all the cells in all tables is give by-

As Lets call the extra multiplicative factor as

The derived expression of comes from the fact that is the expected value of the variable . As the value taken by this variable lies in and the probabilities , we can state that the

When and