No Repetition: Fast Streaming with Highly Concentrated Hashing

04/02/2020
by   Anders Aamand, et al.
0

To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of this approach are small space algorithms for estimating the number of distinct elements in a stream, or estimating the set similarity between large sets. Using standard strongly universal hashing to process each element, we get a sketch based estimator where the probability of a too large error is, say, 1/4. By performing r independent repetitions and taking the median of the estimators, the error probability falls exponentially in r. However, running r independent experiments increases the processing time by a factor r. Here we make the point that if we have a hash function with strong concentration bounds, then we get the same high probability bounds without any need for repetitions. Instead of r independent sketches, we have a single sketch that is r times bigger, so the total space is the same. However, we only apply a single hash function, so we save a factor r in time, and the overall algorithms just get simpler. Fast practical hash functions with strong concentration bounds were recently proposed by Aamand em et al. (to appear in STOC 2020). Using their hashing schemes, the algorithms thus become very fast and practical, suitable for online processing of high volume data streams.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/03/2022

Understanding the Moments of Tabulation Hashing via Chaoses

Simple tabulation hashing dates back to Zobrist in 1970 and is defined a...
research
05/01/2019

Fast hashing with Strong Concentration Bounds

Previous work on tabulation hashing of Pǎtraşcu and Thorup from STOC'11 ...
research
02/03/2021

CountSketches, Feature Hashing and the Median of Three

In this paper, we revisit the classic CountSketch method, which is a spa...
research
08/21/2018

Composite Hashing for Data Stream Sketches

In rapid and massive data streams, it is often not possible to estimate ...
research
04/01/2022

Double-Hashing Algorithm for Frequency Estimation in Data Streams

Frequency estimation of elements is an important task for summarizing da...
research
07/17/2020

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

We present a novel approach for the problem of frequency estimation in d...
research
06/12/2020

Concentration Bounds for the Collision Estimator

We prove a strong concentration result about the natural collision estim...

Please sign up or login with your details

Forgot password? Click here to reset