Double-Hashing Algorithm for Frequency Estimation in Data Streams

04/01/2022
by   Nikita Seleznev, et al.
0

Frequency estimation of elements is an important task for summarizing data streams and machine learning applications. The problem is often addressed by using streaming algorithms with sublinear space data structures. These algorithms allow processing of large data while using limited data storage. Commonly used streaming algorithms, such as count-min sketch, have many advantages, but do not take into account properties of a data stream for performance optimization. In the present paper we introduce a novel double-hashing algorithm that provides flexibility to optimize streaming algorithms depending on the properties of a given stream. In the double-hashing approach, first a standard streaming algorithm is employed to obtain an estimate of the element frequencies. This estimate is derived using a fraction of the stream and allows identification of the heavy hitters. Next, it uses a modified hash table where the heavy hitters are mapped into individual buckets and other stream elements are mapped into the remaining buckets. Finally, the element frequencies are estimated based on the constructed hash table over the entire data stream with any streaming algorithm. We demonstrate on both synthetic data and an internet query log dataset that our approach is capable of improving frequency estimation due to removing heavy hitters from the hashing process and, thus, reducing collisions in the hash table. Our approach avoids employing additional machine learning models to identify heavy hitters and, thus, reduces algorithm complexity and streamlines implementation. Moreover, because it is not dependent on specific features of the stream elements for identifying heavy hitters, it is applicable to a large variety of streams. In addition, we propose a procedure on how to dynamically adjust the proposed double-hashing algorithm when frequencies of the elements in a stream are changing over time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/17/2020

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

We present a novel approach for the problem of frequency estimation in d...
research
02/07/2021

A Bayesian nonparametric approach to count-min sketch under power-law data streams

The count-min sketch (CMS) is a randomized data structure that provides ...
research
04/02/2020

No Repetition: Fast Streaming with Highly Concentrated Hashing

To get estimators that work within a certain error bound with high proba...
research
10/07/2020

New Verification Schemes for Frequency-Based Functions on Data Streams

We study the general problem of computing frequency-based functions, i.e...
research
08/21/2018

Composite Hashing for Data Stream Sketches

In rapid and massive data streams, it is often not possible to estimate ...
research
10/27/2022

In-stream Probabilistic Cardinality Estimation for Bloom Filters

The amount of data coming from different sources such as IoT-sensors, so...
research
03/02/2019

One Table to Count Them All: Parallel Frequency Estimation on Single-Board Computers

Sketches are probabilistic data structures that can provide approximate ...

Please sign up or login with your details

Forgot password? Click here to reset