Improving Tug-of-War sketch using Control-Variates method

03/04/2022
by   Rameshwar Pratap, et al.
0

Computing space-efficient summary, or a.k.a. sketches, of large data, is a central problem in the streaming algorithm. Such sketches are used to answer post-hoc queries in several data analytics tasks. The algorithm for computing sketches typically requires to be fast, accurate, and space-efficient. A fundamental problem in the streaming algorithm framework is that of computing the frequency moments of data streams. The frequency moments of a sequence containing f_i elements of type i, are the numbers 𝐅_k=∑_i=1^n f_i^k, where i∈ [n]. This is also called as ℓ_k norm of the frequency vector (f_1, f_2, … f_n). Another important problem is to compute the similarity between two data streams by computing the inner product of the corresponding frequency vectors. The seminal work of Alon, Matias, and Szegedy <cit.>, a.k.a. Tug-of-war (or AMS) sketch gives a randomized sublinear space (and linear time) algorithm for computing the frequency moments, and the inner product between two frequency vectors corresponding to the data streams. However, the variance of these estimates typically tends to be large. In this work, we focus on minimizing the variance of these estimates. We use the techniques from the classical Control-Variate method <cit.> which is primarily known for variance reduction in Monte-Carlo simulations, and as a result, we are able to obtain significant variance reduction, at the cost of a little computational overhead. We present a theoretical analysis of our proposal and complement it with supporting experiments on synthetic as well as real-world datasets.

READ FULL TEXT

page 9

page 10

research
02/05/2020

A Deterministic Streaming Sketch for Ridge Regression

We provide a deterministic space-efficient algorithm for estimating ridg...
research
10/10/2019

Efficient Sketching Algorithm for Sparse Binary Data

Recent advancement of the WWW, IOT, social network, e-commerce, etc. hav...
research
03/23/2018

Data Streams with Bounded Deletions

Two prevalent models in the data stream literature are the insertion-onl...
research
10/07/2020

New Verification Schemes for Frequency-Based Functions on Data Streams

We study the general problem of computing frequency-based functions, i.e...
research
08/20/2020

Simple and Efficient Cardinality Estimation in Data Streams

We study sketching schemes for the cardinality estimation problem in dat...
research
08/09/2022

Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams

Today's large-scale services (e.g., video streaming platforms, data cent...
research
02/03/2021

CountSketches, Feature Hashing and the Median of Three

In this paper, we revisit the classic CountSketch method, which is a spa...

Please sign up or login with your details

Forgot password? Click here to reset