1 Introduction
Approximate set membership is a fundamental problem that arises in many highspeed memoryconstrained applications in databases, networking, and search. The Bloom Filter [bloom1970space, mitzenmacher2002compressed, cohen2003spectral] is one of the most famous and widely adopted spaceefficient data structures for approximate set membership. It allows constant time, i.e., membership testing in mere bits space, where is the cardinality of the set
under consideration. Bloom Filters trade a small falsepositive probability for an impressive query time and memory tradeoff. Compared to other sophisticated hashing algorithms
[networkapplications, simpleMPH], the simplicity of Bloom Filters and the ability to cheaply insert elements on the fly makes them quite successful in many latency and memoryconstrained applications.In this work, we are interested in a related multiple set membership testing (MSMT) problem. Here, instead of one set , we are given a set of sets , where and is the universe of the keys. Given a query , our goal is to identify all the sets in that contain . In particular, we want to return the subset , such that if and only if .
A recent nature paper “BItsliced Genomic Signature Index (BIGSI)” [bigSI], motivates the MSMT problem. Here, the authors indexed the entire global corpus of 447,833 bacterial and viral wholegenome sequence datasets using arrays of Bloom Filters. In this application, the authors treated every bacteria genome sequence as a set of all mers (continuous characters strings) in the sequence. There are 447,833 such sets which collectively occupy 170 Terabytes of space. Given a query mer , identifying all the genome sequences that contain is an important prerequisite for numerous computational biology pipelines. For instance, BIGSI applied this functionality to rapidly finding resistance genes MCR1, MCR2 and MCR3, determining the hostrange of 2,827 plasmoids and quantifying antibiotic resistance in archived datasets. This is precisely an MSMT problem. This work also stressed a few important properties of Bloom Filters that made MSMT at scale possible. The ability to perform the streaming updates, manipulation of simple bit arrays, and bitwise operations to obtain the union and intersection were critical for scaling the system. BIGSI shows the biggest possible scale (170Terabyte) that has been achieved for MSMT in practice. The work mentions larger datasets and the need to scale BIGSI to handle millions of datasets, leading to hundreds of millions to billions of sets.
The authors use one Bloom Filter to compress each set (or the bacterial genome sequence) independently. Essentially the MSMT instance is reduced to classical membership testing instances with an expensive query complexity of . Unfortunately, the query complexity , when grows to hundreds of millions, is prohibitively expensive. We will call this approach Arrays of Bloom Filter (ABF).
Motivated by BIGSI scale, our paper is concerned with streaming data structures only, where the data structure can be updated incrementally analogous to Bloom Filters. In the streaming setting, we observe a set of inputs one at a time. Upon observing an element , we must perform an online update to the data structure. We are not allowed to remember after the update is complete and we are not allowed a second pass through the dataset. The streaming setting is also applicable to the type of webscale problems that arise in industryscale product recommendation.
Additionally, we would like to retain the benefits of manipulating bit arrays. Keeping in mind these constraints, the trivial solution of independent Bloom Filter seems to be the best. This is likely the reason for its use in practice and recent application in BIGSI.
The main focus of this paper is to reduce the query time of MSMT while still maintaining the desirable properties associated with Bloom Filters. In particular, we are looking for a method with an efficient querytime algorithm that also has the following three properties: 1. Low false positive rate, 2. Cheap updates for streaming inputs, and 3. A simple, systemsfriendly data structure that is straightforward to parallelize. Existing solutions with fast querytime algorithms are either space inefficient or require complex structures such as minimal perfect hash constructions that are neither simple to implement nor streamable. In this work, we propose RAMBO, a data structure with sublinear query time that has these properties.
1.1 Our Contribution
We show the first nontrivial data structure, RAMBO (Repeated And Merged Bloom Filter), for the MSMT problem. RAMBO achieves sublinear query time and at the same time satisfies the following three needed properties analogous to arrays of Bloom Filter.

The data structure only consists of bit arrays which can be updated incrementally on a stream analogous to Bloom Filters.

The query operation only requires hashing and bitwise AND/OR operations.

The proposed architecture has zero false negatives.
Our proposed idea is a countmin sketch (CMS) [CMS, MACH] where each cell represents a Bloom Filter. During collisions, instead of addition, we perform the union of Bloom Filters. During inference, we take the set intersection instead of the minimum used by CMS. The proposed datastructure is an ideal illustration of the power of countmin sketch beyond frequency estimation, and instead, demonstrates how we can compress nontrivial objects. The idea presented could be of independent interest in itself. In a nutshell, our main result is
Theorem 1.
For number of sets in a multiple set membership problem, where each key is a member of at most constant number of sets, our data structure solves the approximate multiple set membership testing problem using query time
and using at most extra bits per key than Array of Bloom Filters.
2 Notations
Symbol  Notation 
Number of sets  
Total number of insertions in standard Bloom Filter array  
Key  
set  
Universe of the keys  
False positive rate of a Bloom Filter  
Size of a table in RAMBO  
Number of tables in RAMBO  
Max number of sets a key belongs to  
Failure rate of RAMBO 
3 Preliminaries
Here, we introduce preliminary concepts that will be needed in our formal analysis of RAMBO.
Universal Hashing
A 2universal hash function family is defined as
Definition 1.
A family of functions such that is called a 2universal hash family if
The following fact is of particular importance for our analysis. If we construct a hash table with range size using a 2Universal hash function, then every bucket of has expected size of , where is the number of insertions into the hash table. Most of the practical hash functions like murmurhash [horvath] are approximately 2universal.
Bloom Filters
The Bloom Filter is an array of bits which represent a set of elements. It is initialized with all bits 0. During construction, it uses hash functions , each with range , to set bits at the respective locations for a key . In this process, a bit in the Bloom Filter may be set to 1 multiple times, which does not effectively change the state of the Bloom Filter. Once the construction is done, the Bloom Filter can be used to check the membership of a query by calculating the AND of the bits at the locations: . The output will be if all the locations are and it will be if at least one is . Given this structure, it is straightforward to prove that there are no false negatives as every key will set all the bits at locations . However, there are false positives introduced because of the bits for the hash location of may have been set by keys from the set . The false positive rate of the Bloom Filter is given by:
To minimize the false positive rate, the number of hash functions () and the size of Bloom Filter should be:
where is the false positive rate. The Bloom Filter grows in size linearly with the cardinality of the set it represents. They are a constant factor of away from the informationtheoretic lower bound on size. Although this is suboptimal, the Bloom Filter enjoys widespread practical use due to its simplicity and systemsfriendly properties.
Count Min Sketch
Our proposed architecture takes inspiration from the countmin sketch (CMS) [CMS], which is a probabilistic data structure first introduced to identify frequent elements, or heavy hitters, in a data stream. The CMS is a array of integer counts that are incremented online in a randomized fashion. Each input to the CMS is hashed using universal hash functions to one of the rows in each of the columns of the sketch. To perform a streaming update, we find the hash values of an element and increment the counts at those locations (or cells). During querytime, we want to determine the unknown frequency count of an element . We find the cells in the array that maps to under the hash functions. This results in a set of counts. The estimated frequency count of is the minimum over the set of count values.
To understand how the CMS works, let us suppose we have a heavyhitting element and consider the situation where has a hash collision with another element in one of the columns. The final count value for in this column is the sum of the counts of and . If and are both heavy hitters, this is a bad overestimate. Otherwise, the count value is approximately correct. The key observation is that we are unlikely to report dramatic overestimates because we take the minimum overall columns. Intuitively, we are unlikely to have “problematic” hash collisions in every column if we assume that only a few elements are heavy hitters. This condition is common in the sketching literature [charikar2002finding, liu2016one, braverman2016beating] and is sometimes referred to as a powerlaw or sparsity assumption. Under this assumption, we have strong bounds on the error of our recovered estimates [CMS].
4 Formal Problem Statement
We formally introduce the multiple set membership problem. Our problem is fundamentally the same as the one introduced by [mitzenmacher2016omass], but we have a slightly different notion of the failure probability. Our formulation of the problem is motivated by recent problems in webscale search and DNA analysis for which our definition is more appropriate.
Definition 2.
Approximate Multiple Set Membership Testing (MSMT)
Given a set of sets , construct a data structure which, given any query , reports the correct membership status of for every with probability .
Here, we define the failure probability as the probability of reporting an incorrect membership status for even one set from . This is in contrast to [mitzenmacher2016omass], which interprets the failure rate on a perset basis. Otherwise, the two problems are equivalent. It should also be noted that in the context of Bloom Filters and RAMBO, the failure probability is equivalent to the false positive rate.
5 Known Approaches
Array of Bloom Filters (ABF)
As argued before, one straightforward approach to the multiple set membership problem is to reduce the problem to instances of the classical membership testing problem. For this, we store an array of different set representations [bigSI], one for each set. Many data structures can be used to encode the representations, as there are a wide variety of approaches to the approximate set membership problem. For simplicity and concreteness, we consider arrays of Bloom Filters, since they are theoretically wellstudied and are overwhelmingly common in practice.
The analysis for an array of Bloom Filters (ABF) is straightforward. Let filter be associated with set and note that it requires space. Assuming that each filter has the same false positive rate, the ABF requires bits. The perset false positive rate is , and the query cost is linear in the number of sets .
Method  Query Time  Worstcase Memory  Comments  





bits 



bits 

Inverted Index
The inverted index [witten2016data, baeza1999modern, witten1999managing] is an exact database query solution which stores a posting list for each key in the database. In our case, the posting list is the list of set IDs where the given key is present. The total memory utilization of all the posting lists is , where is the bit precision of the set IDs. However, this does not guarantee constant query time by itself. The implementation of the inverted index determines the query time. To have queries, the posting list against the keys should be implemented as a constanttime lookup keyvalue store. One way to implement this data structure is with a hash table with no collisions, essentially using a minimal perfect hashing scheme. Using the bestknown method, this only adds additional 2.07 bits per key. However, all existing minimal perfect hash functions require prior knowledge of all keys and therefore cannot be constructed in the streaming setting. In the context of MSMT, the size of the inverted index is , where is the bit precision needed to store the posting lists. This is far greater than a simple Bloom Filter array in practice. For a large number of sets the bit precision .
6 Intuition Behind our Approach
We draw inspiration from the popular countmin sketch (CMS) data structure for streaming data and the literature of sparse recovery. Assume that we reduce sets into sets (metasets) by randomly combining sets and taking the union. Now, given a query , we only need to perform set membership testing on a small number () of metasets. If a merged set contains , then we know that is a member of one of the sets we merged. In expectation, this reduces the number of candidates from to sets. If we independently repeat the process and request another set membership query over a different table of merged sets, then in expectation we have reduced the potential confusion space to using only two small set membership queries. We see that this goes down exponentially with more repetitions.
We propose a twist to the CMS. We replace the counters in the CMS with Bloom Filters. During sketch construction, instead of addition, we merge sets by taking union in the cell during sketch construction. We refer to our merged Bloom Filter cell as a Bloom Filter Union (BFU) since each cell represents the merged union of sets.
To perform membership testing on this structure, we query each of the BFU cells. If an element is present in a particular set, the corresponding BFUs will be active. If not, some of the BFUs may be active, but there is a strong likelihood that at least one will be inactive. To solve the MSMT problem, we only report sets for which every BFU was active. The intuition is that “unlucky” hash collisions and false positives occur relatively infrequently. The recovery operation is, therefore, an intersection over the active BFUs.
7 RAMBO: Repeated And Merged Bloom Filters
The proposed architecture is an array of tables, each containing cells. The tables in RAMBO are independent to each other as we use independent hash function for each of them. The sets ( in number) are being hashed into one cell in each table (i.e. hashed times). Note that due to the universality of the hash functions, every cell of a table in RAMBO contains an expected random group of sets from the set of sets .
Now each cell is a joint representation of a group of sets, rather than an individual set representation as in the Array of Bloom Filter architecture. To resolve the individual identity of the set we use repetitions of the table, each with independent random groupings (details in section 7.1). Just like in CMS, the multiple tables will help in singling out a common set (or sets) which represent the query. Section 7.2 explains the details about the set resolution in detail.
7.1 Creation and Insertion
RAMBO has a total of tables with cells each. We use universal hash function (section 3) analogous to CMS for creating the cells. Unlike CMS which combines elements colliding in the same cell by addition, we instead have a Bloom Filter for the union of sets colliding in the same cell. The size of this union of Bloom Filter (BFU) is dependent on the number of unique keys () in the cell.
The union is over the number of sets ( here) in a cell. The cardinality of the union will be discussed in section 8.1.
Our RAMBO data structure is merely a CMS (of size ) of Bloom Filters. Clearly, this structure is conducive to updates with a data stream. Every new key is hashed into tables with its set ID. We can also avoid fixed size Bloom Filter by replacing them with scalable Bloom Filter [scalableBF] architecture.
7.2 Notion of Separability
Since each cell represents more than one set, we intuitively need multiple repetitions to distinguish between two given sets. We formalize this notion as set separability  that is, the ability of RAMBO to determine which set contains the query from the BFUs.
Definition 3.
Separability
We say that two sets and , where are separable if for any , the RAMBO architecture returns the output and with nonzero probability.
We will see in section 8.2 that separability and the false positive rate depend on the and RAMBO parameters.
7.3 MSMT Queries
A query can occur in multiple sets due to its multiplicity. The RAMBO architecture shown in figure 2 returns at least one cell per table at the query time. Due to the possibility of false positives, it can return multiple cells per table. This is a slight departure from standard CMS. Our query process, therefore, is to first take the union among the returned cells from each table followed by the intersection of all those unions across tables. The union and intersection can be accomplished using fast bitwise operations. Algorithm 1 summarizes the flow. The extra query time incurred during this process is analyzed in section 8.3
. This vector
(as the problem definition suggest) is the final output membership vector of the given query.8 Analysis
RAMBO is a matrix () of Bloom Filters it enjoys all the properties of arrays of Bloom Filter, i.e., streaming updates and bitwise operations. RAMBO has two important parameters and , which control the memory and query time and also affect the false positive rate. In the remaining part of the paper, we show that there is an optimal choice of R and B for sublinear query time.
8.1 Memory analysis
Lower Bound
At first glance, it may seem possible that RAMBO can represent the set membership information in using less than space. For instance, if the hash function merges , then it seems that we save space by storing only one copy of . Unfortunately, this intuition is wrong. Any data structure which solves the setofsets problem must solve an equivalent set membership problem that has an informationtheoretic lower bound of bits [carter1978exact].
Lemma 1.
Multiple set membership memory bound
Any data structure which solves the MSMT problem requires the following space:
Where is the false positive rate of the membership testing from this datastructure.
Proof.
Let be the set of all pairs for which and . Then . Any data structure which solves the setofsets membership problem must identify whether a given pair is in . Therefore, the data structure solves the approximate set membership problem and requires space by the information theoretic lower bound [carter1978exact]. ∎
Worst Case Memory Bound
To obtain a worstcase memory of RAMBO, we observe that the memory used by one repetition (one table) is upper bounded by the memory of the Array of Bloom Filters. This easily follows from the fact that each of the pairs from Theorem 1 is inserted into one of the BFUs. Using this fact, we arrive at the following worstcase memory bound.
Lemma 2.
Worst case memory bound
The maximum memory needed by a RAMBO data structure in worst case is bounded by
Where comes from the fact that number of repetitions in the order of (refer to theorem 2), where is the number of sets in our problem. This result means that RAMBO will use at most extra bits per element when compared to the ABF method.
Expected Memory
In practice, the actual memory usage for RAMBO is in between the best and the worstcase memory bounds. Here, we provide an averagecase analysis under a simplifying assumption to analyze the expected performance of our method. To make the analysis simpler we assume that every key has fixed duplicates . This means every item is present in number of sets. We define the memory of RAMBO as
Lemma 3.
For the proposed RAMBO architecture, with size , and data where every key has number of duplicates, the expected memory requirement is
Where
The expectation is defined over variable which takes values from {1,2…V}. Refer appendix A. The factor is . To give an idea of how varies. Following is the plot of the function with respect to :
We will prove the expression of and, for any and in appendix section A.
8.2 False Positives
In this section, we express the total false positive rate of RAMBO in terms of , , , and the false positive rate of each BFU . Our goal in this section is to prove that we achieve a given failure rate using only repetitions. We begin by finding the failure probability for one set and extend this result to all sets.
Our first claim is that RAMBO cannot report false negatives. This follows trivially from the hashbased merging procedure and the fact that each BFU cannot produce false negatives. If an element , then each BFU containing reports membership and therefore we must report according to Algorithm 1.
The perset false positive analysis is more involved. Suppose we have an element . For RAMBO to report that , we require the BFU containing to report membership in each column. There are two ways for this to occur. The first is for to randomly collide with a set that does contain under the hash mapping. Alternatively, the BFU containing can have a false positive due to the underlying Bloom Filter structure. To illustrate, consider the example in Figure 4. Set maps to the BFUs marked by dots, and BFUs containing true positive sets are shaded. If BFUs in columns 1, 3, 4, and 6 were to all have false positives, then we would incorrectly report that . By formalizing this argument, we obtain a lemma that relates the RAMBO parameters to the probability of incorrectly reporting whether .
Lemma 4.
Individual false positive rate
Given the RAMBO data structure with BFUs having a false positive rate of and a query , we assume that belongs to no more than sets. Under these assumptions, the probability of incorrectly reporting that when is
where is the individual false positive rate of the BFUs.
The proof is a straightforward extension of the intuition presented earlier. False positives can be introduced by BFUs that violate the separability condition or by BFUs that happen to have a false positive. Using this theorem, we can construct the overall failure probability for the multiple set membership problem. Recall that we define failure as being incorrect about even one of the sets. By the probability union bound, we can set the individual false positive rate such that we have a bounded overall failure rate.
Lemma 5.
RAMBO failure rate
Given a RAMBO data structure with BFUs having a false positive rate of and a query , we assume that belongs to no more than sets. Under this assumption, the probability () of reporting an incorrect membership status for any of the sets is upper bounded by
where is the individual false positive rate of the BFUs.
A direct consequence of lemma 5 is that we need sublinear RAMBO repetitions to obtain any overall failure probability .
Theorem 2.
Given a failure rate , we need
8.3 Query time analysis
The objective of this section is to show that RAMBO achieves expected sublinear query time. To query the set membership status of an element , we first need to compute Bloom Filter lookups. After we have looked up in each BFU, we find the union over each row and intersection over each column of the sets that hashed to the “active” (returned True) BFUs. We report all sets which survive the iterative union and intersection procedure, as these are the sets with an active path through RAMBO. To find the query time, we need to know the cost of performing these unions and intersections.
Since each column partitions the sets, the union operations do not require any computational overhead. The set intersections between columns, however, require operations, where is the set of all active sets in the present column and is the set of all active sets in the next column. Since there are columns, the total cost for the intersection is . By observing that , we obtain the following theorem.
Lemma 6.
Expected query time
Given the RAMBO data structure with BFUs and a query that is present in at most sets, the expected query time
Here, is the number of sets and is the false positive rate of each BFU.
By taking , and according to Theorem 2, we get an expression for the query time in terms of the overall failure probability and the total number of sets . Our main theorem is a simplified version of this result where we omit the lower order terms.
Theorem 3.
MSMT query time
Given a RAMBO data structure and a query that is present in at most sets, RAMBO solves the MSMT problem with probability in expected query time
Here, is the number of sets and is the false positive rate of each BFU and is independent of and .
9 Conclusion
We propose a simple yet elegant datastructure RAMBO for Multiple Set Membership Testing problem. Many practical problems like search in resource constraint settings and sketching based datastructure for fast sampling can be solved using RAMBO. We have proved the query time to be sublinear () in number of sets . We have also seen that the memory requirement of the RAMBO is times the size of the Array of Bloom Filter in expectation. Given that there are duplicates of keys (which occurs in practice), the additional memory factor is minimal. Moreover, the proposed datastructure can be easily parallelised and supports insertions of streaming data keys. In the future, it will be interesting to analyse RAMBO in the case of different multiplicity () for every key.
References
Appendix A Appendix
The proof of lemma 3 is as follows:
With . The size of the union is . Where , are the total number of insertions. is a set. Now if we divide these sets into groups (by using a universal hashing method to hash sets into bins), we will get union of random group of sets with every key in a bin has varying number of duplicates . . 0 duplicate means the element does not exists and 1 duplicate means the element has only one copy. This implies that
is a random variable with some distribution. We are going to find that distribution.
The size of a bucket is given by :
Where, there are N1 number of insertions in bucket, is a random variable, . By linearity of expectation, we can say that
We can see as a multiple count reduction factor, which works as a normalizer for multiplicity of the keys.
After randomly dividing the sets into cells, we will analyze the probability () of having number of duplicates in a bucket of a hash table of size .
An element can have atmost duplicates or it can have no presence in the bucket. This problem is similar to putting balls in bins. The probability of having all balls in one bucket is given by .
The probability of having all balls in one bucket and remaining one in any other bin in given by
Hence the probability of getting balls in one bucket and in remaining others is given by:
From the expression of and and as The expected size of a bucket can be written as
The expected size of all the cells in all tables is give by
As Lets call the extra multiplicative factor as
The derived expression of comes from the fact that is the expected value of the variable . As the value taken by this variable lies in and the probabilities , we can state that the
When and
Comments
There are no comments yet.