Bloom filter (BF) is a widely used data structure for low-memory and high-speed approximate membership testing (Bloom, 1970). Bloom filters compress a given set into bit arrays, where we can approximately test whether a given element (or query) belongs to a set , i.e., or otherwise. Several applications, in particular caching in memory constrained systems, have benefited tremendously from BF (Broder et al., 2002).
Bloom filter ensures a zero false negative rate (FNR), which is a critical requirement for many applications. However, BF does not have a non-zero false positive rate (FPR) (Dillinger and Manolios, 2004) due to hashing collisions, which measures the performance of BF. There is a known theoretical limit to this reduction. To achieve a FPR of , BF costs at least bits (), which is off from the theoretical lower bound (Carter et al., 1978). Mitzenmacher (2002) proposed Compressed Bloom filter to address the suboptimal space usage of BF, where the space usage can reach the theoretical lower bound in the optimal case.
To achieve a more significant reduction of FPR, researchers have generalized BF and incorporated information beyond the query itself to break through the theoretical lower bound of space usage. Bruck et al. (2006) has made use of the query frequency and varied the number of hash functions based on the query frequency to reduce the overall FPR. Recent work (Kraska et al., 2018; Mitzenmacher, 2018) has proposed to improve the performance of standard Bloom filter by incorporating a machine learning model. This approach paves a new hope of reducing false positive rates beyond the theoretical limit, by using context-specific information in the form of a machine learning model (Hsu et al., 2019). Rae et al. (2019) further proposed Neural Bloom Filter that learns to write to memory using a distributed write scheme and achieves compression gains over the classical Bloom filter.
The key idea behind Kraska et al. (2018) is to use the machine learning model as a pre-filter to give each query a score .
is usually positively associated with the odds that. The assumption is that in many practical settings, the membership of a query in the set can be figured out from observable features of and such information is captured by the classifier assigned score . The proposal of Kraska et al. uses this score and treats query with score higher than a pre-determined threshold (high confidence predictions) as a direct indicator of the correct membership. Queries with scores less than are passed to the back-up Bloom filter.
Compared to the standard Bloom filter, learned Bloom filter (LBF) uses a machine learning model to answer keys with high score . Thus, the classifier reduces the number of the keys hashed into the Bloom filter. When the machine learning model has a reliable prediction performance, learned Bloom filter significantly reduce the FPR and save memory usage (Kraska et al., 2018). Mitzenmacher (2018)
further provided a formal mathematical model for estimating the performance of LBF. In the same paper, the author proposed a generalization named sandwiched learned Bloom filter (sandwiched LBF), where an initial filter is added before the learned oracle to improve the FPR if the parameters are chosen optimally.
Wastage of Information:
For existing learned Bloom filters to have a lower FPR, the classifier score greater than the threshold should have a small probability of wrong answer. Also, a significant fraction of the keys should fall in this high threshold regime to ensure that the backup filter is small. However, when the score is less than , the information in the score is never used. Thus, there is a clear waste of information. For instance, consider two elements and with . In the existing solutions, and will be treated in the exact same way, even though there is enough prior to believing that is more likely positive compared to .
Strong dependency on Generalization:
It is natural to assume that prediction with high confidence implies a low FPR when the data distribution does not change. However, this assumption is too strong for many practical settings. First and foremost, the data distribution is likely to change in an online streaming environment where Bloom filters are deployed. Data streams are known to have bursty nature with drift in distribution (Kleinberg, 2003). As a result, the confidence of the classifier, and hence the threshold, is not completely reliable. Secondly, the susceptibility of machine learning oracles to adversarial examples brings new vulnerability in the system. Examples can be easily created where the classifier with any given confidence level , is incorrectly classified. Bloom filters are commonly used in networks where such increased adversarial false positive rate can hurt the performance. An increased latency due to collisions can open new possibilities of Denial-of-Service attacks (DoS) (Feinstein et al., 2003).
For a binary classifier, the density of score distribution, shows a different trend for elements in the set and outside the set . We observe that for keys, shows ascending trend as increases while has an opposite trend. To reduce the overall FPR, we need lower FPRs for groups with a high . Hence, if we are tuning the number of hash functions differently, more hash functions are required for the corresponding groups. While for groups with a few non-keys, we allow higher FPRs. This variability is the core idea to obtaining a sweeter trade-off.
Instead of only relying on the classifier whether score is above a single specific threshold, we propose two algorithms, Ada-BF and disjoint Ada-BF, that rely on the complete spectrum of scores regions by adaptively tuning Bloom filter parameters in different score regions. 1) Ada-BF tunes the number of hash functions differently in different regions to adjust the FPR adaptively; disjoint Ada-BF allocates variable memory Bloom filters to each region. 2) Our theoretical analysis reveals a new set of trade-offs that brings lower FPR with our proposed scheme compared to existing alternatives. 3) We evaluate the performance of our algorithms on two datasets: malicious URLs and malware MD5 signatures, where our methods reduce the FPR by over 80% and save 50% of the memory usage over existing learned Bloom filters.
Our paper includes some notations that need to be defined here. Let denote the index set . We define query as a key if , or a non-key if . Let denote the size of keys (), and denote the size of non-keys. We denote as the number of hash functions used in the Bloom filter.
2 Review: Bloom Filter and Learned Bloom Filter
Standard Bloom filter for compressing a set consists of an -bits array and independent random hash function , taking integer values between and , i.e., . The bit array is initialized with all . For every item , the bit value of , for all , is set to .
To check a membership of an item in the set , we return true if all the bits , for all , have been set to 1. It is clear that Bloom filter has zero FNR (false negative rate). However, due to lossy hash functions, may be wrongly identified to be positive when while all the s are set to 1 due to random collisions. It can be shown that if the hash functions are independent, the expected FPR can be written as follows
Learned Bloom filter:
Learned Bloom filter adds a binary classification model to reduce the effective number of keys going to the Bloom filter. The classifier is pre-trained on some available training data to classify whether any given query belongs to or not based on its observable features. LBF sets a threshold, , where is identified as a key if . Otherwise, will be inserted into a Bloom filter to identify its membership in a further step (Figure 1). Like standard Bloom filter, LBF also has zero FNR. And the false positives can be either caused by that false positives of the classification model () or that of the Bloom filter.
It is clear than when the region contains large number of keys, the number of keys inserted into the Bloom filter decreases which leads to favorable FPR. However, since we identify the region as positives, higher values of is better. At the same time, large decreases the number of keys in the region , increasing the load of the Bloom filter. Thus, there is a clear trade-off.
3 A Strict Generalization: Adaptive Learned Bloom Filter (Ada-BF)
With the formulation of LBF in the previous section, LBF actually divides the into two groups. When , will be identified as a key directly without testing with the Bloom filter. In other words, it uses zero hash function to identify its membership. Otherwise, we will test its membership using hash functions. In other view, LBF switches from hash functions to no hash function at all, based on or not. Continuing with this mindset, we propose adaptive learned Bloom filter, where is divided into groups based on , and for group , we use hash functions to test its membership. The structure of Ada-BF is represented in Figure 1(b).
More specifically, we divide the spectrum into regions, where if , . Without loss of generality, here, we assume . Keys from group are inserted into Bloom filter using independent hash functions. Thus, we use different number of universal hash functions for keys from different groups.
For a group , the expected FPR can be expressed as,
where is the number of keys falling in group , and is the number of hash functions used in group . By varying , can be controlled differently for each group.
Variable number of hash functions gives us enough flexibility to tune the FPR of each region. To avoid the bit array being overloaded, we only increase the for groups with large number of keys , while decrease for groups with small . It should be noted that shows an opposite trend compared to as increases (Figure 2). Thus, there is a need for variable tuning, and a spectrum of regions gives us the room to exploit these variability efficiently. Clearly, Ada-BF generalizes the LBF. When Ada-BF only divides the queries into two groups, by setting , and , Ada-BF reduces to the LBF.
3.1 Simplifying the Hyper-Parameters
To implement Ada-BF, there are some hyper-parameters to be determined, including the number of hash functions for each group and the score thresholds to divide groups, (, ). Altogether, we need to tune hyper-parameters. Use these hyper-parameters, for Ada-BF, the expected overall FPR can be expressed as,
where . Empirically, can be estimated by ( is size of non-keys in the training data and is size of non-keys belonging to group ). It is almost impossible to find the optimal hyper-parameters that minimize the in reasonable time. However, since the estimated false positive items , we prefer to be similar across groups when is minimized. While decreases exponentially fast with larger , to keep stable across different groups, we require to grow exponentially fast with . Moreover, since increases as becomes smaller for most cases, should also be larger for smaller . Hence, to balance the number of false positive items, as diminishes, we should increase linearly and let grow exponentially fast.
With this idea, we provide a strategy to simplify the tuning procedure. We fix and for . Since the true density of is unknown. To implement the strategy, we estimate by and fix . This strategy ensures to grow exponentially fast with . Now, we only have three hyper-parameters, , and (). By default, we may also set , equivalent to identifying all the items in group as keys.
Assume 1) the scores of non-keys, , are independently following a distribution ; 2) The scores of non-keys in the training set are independently sampled from a distribution . Then, the overall estimation error of , , converges to 0 in probability as becomes larger. Moreover, if , with probability at least , we have .
Even though in the real application, we cannot access the exact value of , which may leads to the estimation error of the real . However, Lemma 1 shows that as soon as we can collect enough non-keys to estimate the , the estimation error is almost negligible. Especially for the large scale membership testing task, collecting enough non-keys is easy to perform.
3.2 Analysis of Adaptive Learned Bloom Filter
Compared with the LBF, Ada-BF makes full use the of the density distribution and optimizes the FPR in different regions. Next, we will show Ada-BF can reduce the optimal FPR of the LBF without increasing the memory usage.
When and , the expected FPR follows,
where . To simplify the analysis, we assume in the following theorem. Given the number of groups is fixed, this assumption is without loss of generality satisfied by raising since will increase as becomes larger. For comparisons, we also need of the LBF to be equal to of the Ada-BF. In this case, queries with scores higher than are identified as keys directly by the machine learning model. So, to compare the overall FPR, we only need to compare the FPR of queries with scores lower than .
For Ada-BF, given for all , if there exists such that holds, and for all ( is the number of keys in group ). When g is large enough and , then Ada-BF has smaller FPR than the LBF. Here is the number of hash functions of the LBF.
Theorem 1 requires the number of keys keeps increasing while decreases exponentially fast with . As shown in figure 2, on real dataset, we observe from the histogram that as score increases, decreases very fast while increases. So, the assumptions of Theorem 1 are more or less satisfied.
Moreover, when the number of buckets is large enough, the optimal of the LBF is large as well. Given the assumptions hold, theorem 1 implies that we can choose a larger to divide the spectrum into more groups and get better FPR. The LBF is sub-optimal as it only has two regions. Our experiments clearly show this trend. For figure 3(a), Ada-BF achieves 25% of the FPR of the LBF when the bitmap size 200Kb, while when the budget of buckets 500Kb, Ada-BF achieves 15% of the FPR of the LBF. For figure 3(b), Ada-BF only reduces the FPR of the LBF by 50% when the budget of buckets 100Kb, while when the budget of buckets 300Kb, Ada-BF reduces 70% of the FPR of the LBF. Therefore, both the analytical and experimental results indicate superior performance of Ada-BF by dividing the spectrum into more small groups. On the contrary, when is small, Ada-BF is more similar to the LBF, and their performances are less differentiable.
4 Disjoint Adaptive Learned Bloom Filter (Disjoint Ada-BF)
Ada-BF divides keys into groups based on their scores and hashes the keys into the same Bloom filter using different numbers of hash functions. With the similar idea, we proposed an alternative approach, disjoint Ada-BF, which also divides the keys into groups, but hashes keys from different groups into independent Bloom filters. The structure of disjoint Ada-BF is represented in Figure 1(c). Assume we have total budget of bits for the Bloom filters and the keys are divided into groups using the same idea of that in Ada-BF. Consequently, the keys from group are inserted into -th Bloom filter whose length is (). Then, during the look up stage, we just need to identify a query’s group and check its membership in the corresponding Bloom filter.
4.1 Simplifying the Hyper-Parameters
Analogous to Ada-BF, disjoint Ada-BF also has a lot of hyper-parameters, including the thresholds of scores for groups division and the lengths of each Bloom filters. To determine thresholds , we use similar tuning strategy discussed in the previous section of tuning the number of groups and . To find that optimizes the overall FPR, again, we refer to the idea in the previous section that the expected number of false positives should be similar across groups. For a Bloom filter with buckets, the optimal number of hash functions can be approximated as , where is the number of keys in group . And the corresponding optimal expected FPR is (). Therefore, to enforce the expected number of false items being similar across groups, needs to satisfy
Since is known given the thresholds and the total budget of buckets are known, thus, can be solved accordingly. Moreover, when the machine learning model is accurate, to save the memory usage, we may also set , which means the items in group will be identified as keys directly.
4.2 Analysis of Disjoint Adaptive Learned Bloom Filter
The disjoint Ada-BF uses a group of shorter Bloom filters to store the hash outputs of the keys. Though the approach to control the FPR of each group is different from the Ada-BF, where the Ada-BF varies and disjoint Ada-BF changes the buckets allocation, both methods share the same core idea to lower the overall FPR by reducing the FPR of the groups dominated by non-keys. Disjoint Ada-BF allocates more buckets for these groups to a achieve smaller FPR. In the following theorem, we show that to achieve the same optimal expected FPR of the LBF, disjoint Ada-BF consumes less buckets. Again, for comparison we need of the LBF is equal to of the disjoint Ada-BF.
If and for all ( is the number of keys in group ), to achieve the optimal FPR of the LBF, the disjoint Ada-BF consumes less buckets compared with the LBF when is large.
We test the performance of four different learned Bloom filters: 1) standard Bloom filter, 2) learned Bloom filter, 3) sandwiched learned Bloom filter, 4) adaptive learned Bloom filter, and 5) disjoint adaptive learned Bloom filter. We use two datasets which have different associated tasks, namely: 1) Malicious URLs Detection and 2) Virus Scan. Since all the variants of Bloom filter structures ensure zero FNR, the performance is measured by their FPRs and corresponding memory usage.
5.1 Task1: Malicious URLs Detection
We explore using Bloom filters to identify malicious URLs. We used the URLs dataset downloaded from Kaggle, including 485,730 unique URLs. 16.47% of the URLs are malicious, and others are benign. We randomly sampled 30% URLs (145,719 URLs) to train the malicious URL classification model. 17 lexical features are extracted from URLs as the classification features, such as “host name length”, “path length”, “length of top level domain”, etc. We used “sklearn.ensemble.RandomForestClassifier111 The Random Forest classifier consists 10 decision trees, and each tree has at most 20 leaf nodes.
The Random Forest classifier consists 10 decision trees, and each tree has at most 20 leaf nodes.
” to train a random forest model. After saving the model with “pickle”, the model file costs 146Kb in total. “sklearn.predict_prob" was used to give scores for queries.
We tested the optimal FPR for the four learned Bloom filter methods under the total memory budget 200Kb to 500Kb (kilobits). Since the standard BF does not need a machine learning model, to make a fair comparison, the bitmap size of BF should also include the machine learning model size (146 Kb in this experiment). Thus, the total bitmap size of BF is 346Kb to 646Kb. To implement the LBF, we tuned between and , and picked the one giving the minimal FPR. The number of hash functions was determined by , where is the number of keys hashed into the Bloom filter conditional . To implement the sandwiched LBF, we searched the optimal and calculated the corresponding initial and backup filter size by the formula in Mitzenmacher (2018). When the optimal backup filter size is larger than the total bits budget, sandwiched LBF does not need a initial filter and reduces to a standard LBF. For the Ada-BF, we used the tuning strategy described in the previous section. was set to by default. Thus, we only need to tune the combination of that gives the optimal FPR. Similarly, for disjoint Ada-BF, we fixed and searched for the optimal .
Our trained machine learning model has a classification accuracy of 0.93. Considering the non-informative frequent class classifier (just classify as benign URL) gives accuracy of 0.84, our trained learner is not a strong classifier. However, the distribution of scores is desirable (Figure 2), where as increases, the empirical density of decreases for non-keys and also increases for keys. In our experiment, when the sandwiched LBF is optimized, the backup filter size always exceeds the total bitmap size. Thus, it reduces to the LBF and has the same FPR (as suggested by Figure 4(a)).
Our experiment shows that compared to the LBF and sandwiched LBF, both Ada-BF and disjoint Ada-BF achieve much lower FPRs. When filter size Kb, Ada-BF reduces the FPR by 81% compared to LBF or sandwiched LBF (disjoint FPR reduces the FPR by 84%). Moreover, to achieve a FPR , Ada-BF and disjoint Ada-BF only require 200Kb, while both LBF and the sandwiched LBF needs more than 350Kb. And to get a FPR , Ada-BF and disjoint Ada-BF reduce the memory usage from over 500Kb of LBF to 300Kb, which shows that our proposed algorithms save over 40% of the memory usage compared with LBF and sandwiched LBF.
5.2 Task 2: Virus Scan
Bloom filter is widely used to match the file’s signature with the virus signature database. Our dataset includes the information of 41323 benign files and 96724 viral files. The virus files are collected from VirusShare database (13). The dataset provides the MD5 signature of the files, legitimate status and other 53 variables characterizing the file, like “Size of Code”, “Major Link Version” and “Major Image Version”. We trained a machine learning model with these variables to differentiate the benign files from the viral documents. We randomly selected 20% samples as the training set to build a binary classification model using Random Forest model 222The Random Forest classifier consists 15 decision trees, and each tree has at most 5 leaf nodes.. We used “sklearn.ensemble.RandomForestClassifier” to tune the model, and the Random Forest classifier costs about 136Kb. The classification model achieves 0.98 prediction accuracy on the testing set. The predicted the class probability (with the function “predict_prob” in “sklearn” library) is used as the score . Other implementation details are similar to that in Task 1.
As the machine learning model achieves high prediction accuracy, figure 4 suggests that all the learned Bloom filters show huge advantage over the standard BF where the FPR is reduced by over 98%. Similar to the previous experiment results, we observe consistently lower FPRs of our algorithms although the the score distributions are not smooth or continuous (Figure 3). Again, our methods show very similar performance. Compared with LBF, our methods reduce the FPRs by over 80%. To achieve a 0.2% FPR, the LBF and sandwiched LBF cost about 300Kb bits, while Ada-BF only needs 150Kb bits, which is equivalent to 50% memory usage reduction compared to the previous methods.
5.3 Sensitivity to Hyper-parameter Tuning
Compared with the LBF and sandwiched LBF where we only need to search the space of to optimize the FPR, our algorithms require to tune a series of score thresholds. In the previous sections, we have proposed a simple but useful tuning strategies where the score thresholds can be determined by only two hyper-parameters, . Though our hyper-parameter tuning technique may lead to a sub-optimal choice, our experiment results have shown we can still gain significantly lower FPR compared with previous LBF. Moreover, if the number of groups is misspecified from the optimal choice (of ), we can still achieve very similar FPR compared with searching both and . Figure 5 shows that for both Ada-BF and disjoint Ada-BF, tuning while fixing has already achieved similar FPRs compared with optimal case by tuning both , which suggests our algorithm does not require very accurate hyper-parameter tuning to achieve significant reduction of the FPR.
5.4 Discussion: Sandwiched Learned Bloom filter versus Learned Bloom filter
Sandwiched LBF is a generalization of LBF and performs no worse than LBF. Although Mitzenmacher (2018) has shown how to allocate bits for the initial filter and backup filter to optimize the expected FPR, their result is based on the a fixed FNR and FPR. While for many classifiers, FNR and FPR are expressed as functions of the prediction score . Figure 4(a) shows that the sandwiched LBF always has the same FPR as LBF though we increase the bitmap size from 200Kb to 500Kb. This is because the sandwiched LBF is optimized when corresponds to a small FPR and a large FNR, where the optimal backup filter size even exceeds the total bitmap size. Hence, we should not allocate any bits to the initial filter, and the sandwiched LBF reduces to LBF. On the other hand, our second experiment suggests as the bitmap size becomes larger, sparing more bits to the initial filter is clever, and the sandwiched LBF shows the its advantage over the LBF (Figure 6(b)).
We have presented new approaches to implement learned Bloom filters. We demonstrate analytically and empirically that our approaches significantly reduce the FPR and save the memory usage compared with the previously proposed LBF and sandwiched LBF even when the learner’s discrimination power . We envision that our work will help and motivate integrating machine learning model into probabilistic algorithms in a more efficient way.
- Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13 (7), pp. 422–426. Cited by: §1.
- Network applications of bloom filters: a survey. In Internet Mathematics, Cited by: §1.
- Weighted bloom filter. In 2006 IEEE International Symposium on Information Theory, pp. 2304–2308. Cited by: §1.
Exact and approximate membership testers.
Proceedings of the tenth annual ACM symposium on Theory of computing, pp. 59–65. Cited by: §1.
- Bloom filters in probabilistic verification. In Formal Methods in Computer-Aided Design, A. J. Hu and A. K. Martin (Eds.), Berlin, Heidelberg, pp. 370. External Links: Cited by: §1.
- Statistical approaches to ddos attack detection and response. In Proceedings DARPA information survivability conference and exposition, Vol. 1, pp. 303–314. Cited by: §1.
- Learning-based frequency estimation algorithms. In International Conference on Learning Representations, External Links: Cited by: §1.
- Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery 7 (4), pp. 373–397. Cited by: §1.
- The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pp. 489–504. Cited by: §1, §1, §1.
- Compressed bloom filters. IEEE/ACM Transactions on Networking (TON) 10 (5), pp. 604–612. Cited by: §1.
- A model for learned bloom filters and optimizing by sandwiching. In Advances in Neural Information Processing Systems, pp. 464–473. Cited by: §1, §1, §5.1, §5.4.
- Meta-learning neural bloom filters. arXiv preprint arXiv:1906.04304. Cited by: §1.
-  VirusShare 2018. Note: https://virusshare.com/research.4n6 Cited by: §5.2.
Appendix A Sensitivity to hyper-parameter tuning
Appendix B More comparisons between the LBF and sandwiched LBF
Appendix C Proof of the Statements
Proof of Lemma 1:
Let , then , and counts the number of non-keys falling in group and . To upper bound the probability of the overall estimation error of , first, we need to evaluate its expectation, .
Sinceis large, . Thus, we can approximate (if , ). Then, the expectation of overall error is approximated by , which goes to as becomes larger.
We need to further upper bound the tail probability of
. First, we upper bound the variance of,
Now, by envoking the Chebyshev’s inequality,
Thus, converges to 0 in probability as .
Proof of Theorem 1:
For comparison, we choose , for both LBF and Ada-BF, queries with scores larger than are identified as keys directly by the same machine learning model. Thus, to compare the overall FPR, we only need to evaluate the FPR of queries with score lower than .
Let be the probability of a key with score lower than . Let denote the number of keys with score less than , . For learned Bloom filter using hash functions, the expected FPR follows,
where is the length of the Bloom filter. For Ada-BF, assume we fix the number of groups . Then, we only need to determine and . Let The expected FPR of the Ada-BF is,
where . Next, we give a strategy to select which ensures a lower FPR of Ada-BF than LBF.
Proof of Theorem 2:
Let . By the tuning strategy described in the previous section, we require the expected false positive items should be similar across the groups. Thus, we have
where is the budget of buckets for group . For group , since all the queries are identified as keys by the machine learning model directly, thus, . Given length of Bloom filter for group 1, , the total budget of buckets can be expressed as,
Let and . Let denote the number of keys with score less than , , and be the number of keys in group , . Due to , we have . Moreover, since , queries with score higher than have the same FPR for both disjoint Ada-BF and LBF. So, we only need to compare the FPR of the two methods when the score is lower than . If LBF and Ada-BF achieve the same optimal expected FPR, we have
where is the budget of buckets of LBF. Let . Next, we upper bound with .
Therefore, we can lower bound ,
Now, we can lower bound ,
Since is a negative constant, while approaches to when is large. Therefore, when is large, and is strictly larger than . So, disjoint Ada-BF consumes less memory than LBF to achieve the same expected FPR.