In-stream Probabilistic Cardinality Estimation for Bloom Filters

10/27/2022
by   Remy Scholler, et al.
0

The amount of data coming from different sources such as IoT-sensors, social networks, cellular networks, has increased exponentially during the last few years. Probabilistic Data Structures (PDS) are efficient alternatives to deterministic data structures suitable for large data processing and streaming applications. They are mainly used for approximate membership queries, frequency count, cardinality estimation and similarity research. Finding the number of distinct elements in a large dataset or in streaming data is an active research area. In this work, we show that usual methods based on Bloom filters for this kind of cardinality estimation are relatively accurate on average but have a high variance. Therefore, reducing this variance is interesting to obtain accurate statistics. We propose a probabilistic approach to estimate more accurately the cardinality of a Bloom filter based on its parameters, i.e., number of hash functions k, size m, and a counter s which is incremented whenever an element is not in the filter (i.e., when the result of the membership query for this element is negative). The value of the counter can never be larger than the exact cardinality due to the Bloom filter's nature, but hash collisions can cause it to underestimate it. This creates a counting error that we estimate accurately, in-stream, along with its standard deviation. We also discuss a way to optimize the parameters of a Bloom filter based on its counting error. We evaluate our approach with synthetic data created from an analysis of a real mobility dataset provided by a mobile network operator in the form of displacement matrices computed from mobile phone records. The approach proposed here performs at least as well on average and has a much lower variance (about 6 to 7 times less) than state of the art methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/19/2018

Do You Like What I Like? Similarity Estimation in Proximity-based Mobile Social Networks

While existing social networking services tend to connect people who kno...
research
03/08/2020

Multiset Synchronization with Counting Cuckoo Filters

Set synchronization is a fundamental task in distributed applications an...
research
02/19/2019

In oder Aus

Bloom filters are data structures used to determine set membership of el...
research
04/01/2022

Double-Hashing Algorithm for Frequency Estimation in Data Streams

Frequency estimation of elements is an important task for summarizing da...
research
10/15/2019

Privacy Preserving Count Statistics

The ability to preserve user privacy and anonymity is important. One of ...
research
08/13/2019

On Occupancy Moments and Bloom Filter Efficiency

Two multivariate committee distributions are shown to belong to Berg's f...
research
07/17/2020

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

We present a novel approach for the problem of frequency estimation in d...

Please sign up or login with your details

Forgot password? Click here to reset