Daisy Bloom Filters

05/30/2022
by   Ioana O. Bercea, et al.
0

Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) are Bloom filters that adapt the number of hash functions according to the query element. That is, they use a sequence of hash functions h_1, h_2, … and insert x by setting the bits in k_x positions h_1(x), h_2(x), …, h_k_x(x) to 1, where the parameter k_x depends on x. Similarly, a query for x checks whether the bits at positions h_1(x), h_2(x), …, h_k_x(x) contain a 0 (in which case we know that x was not inserted), or contains only 1s (in which case x may have been inserted, but it could also be a false positive). In this paper, we determine a near-optimal choice of the parameters k_x in a model where n elements are inserted independently from a probability distribution 𝒫 and query elements are chosen from a probability distribution 𝒬, under a bound on the false positive probability F. In contrast, the parameter choice of Bruck et al., as well as follow-up work by Wang et al., does not guarantee a nontrivial bound on the false positive rate. We refer to our parameterization of the weighted Bloom filter as a Daisy Bloom filter. For many distributions 𝒫 and 𝒬, the Daisy Bloom filter space usage is significantly smaller than that of Standard Bloom filters. Our upper bound is complemented with an information-theoretical lower bound, showing that (with mild restrictions on the distributions 𝒫 and 𝒬), the space usage of Daisy Bloom filters is the best possible up to a constant factor. Daisy Bloom filters can be seen as a fine-grained variant of a recent data structure of Vaidya, Knorr, Mitzenmacher and Kraska. Like their work, we are motivated by settings in which we have prior knowledge of the workload of the filter, possibly in the form of advice from a machine learning algorithm.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset