Composite Hashing for Data Stream Sketches

08/21/2018
by   Arijit Khan, et al.
0

In rapid and massive data streams, it is often not possible to estimate the frequency of items with complete accuracy. To perform the operation in a reasonable amount of space and with sufficiently low latency, approximated methods are used. The most common ones are variations of the Count-Min sketch. By using multiple hash functions, they summarize massive streams in sub-linear space. In reality, data item ids or keys can be modular, e.g., a graph edge is represented by source and target node ids, a 32-bit IP address is composed of four 8-bit words, a web address consists of domain name, domain extension, path, and filename, among many others. In this paper, we investigate the modularity property of item keys, and systematically develop more accurate, composite hashing strategies, such as employing multiple independent hash functions that hash different modules in a key and their combinations separately, instead of hashing the entire key directly into the sketch. However, our problem of finding the best hashing strategy is non-trivial, since there are exponential number of ways to combine the modules of a key before they can be hashed into the sketch. Moreover, given a fixed size allocated for the entire sketch, it is hard to find the optimal range of all hash functions that correspond to different modules and their combinations. We solve both these problems with extensive theoretical analysis, and perform thorough experiments with real-world datasets to demonstrate the accuracy and efficiency of our proposed method, MOD-Sketch.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/27/2023

Locally Uniform Hashing

Hashing is a common technique used in data processing, with a strong imp...
research
04/02/2020

No Repetition: Fast Streaming with Highly Concentrated Hashing

To get estimators that work within a certain error bound with high proba...
research
04/01/2022

Double-Hashing Algorithm for Frequency Estimation in Data Streams

Frequency estimation of elements is an important task for summarizing da...
research
01/07/2022

GCWSNet: Generalized Consistent Weighted Sampling for Scalable and Accurate Training of Neural Networks

We develop the "generalized consistent weighted sampling" (GCWS) for has...
research
08/19/2020

The Power of Hashing with Mersenne Primes

The classic way of computing a k-universal hash function is to use a ran...
research
01/03/2019

A Fast Sketch Method for Mining User Similarities over Fully Dynamic Graph Streams

Many real-world networks such as Twitter and YouTube are given as fully ...
research
11/07/2017

Finding Heavily-Weighted Features in Data Streams

We introduce a new sub-linear space data structure---the Weight-Median S...

Please sign up or login with your details

Forgot password? Click here to reset