Sampling-based Estimation of the Number of Distinct Values in Distributed Environment

06/11/2022
by   Jiajun Li, et al.
0

In data mining, estimating the number of distinct values (NDV) is a fundamental problem with various applications. Existing methods for estimating NDV can be broadly classified into two categories: i) scanning-based methods, which scan the entire data and maintain a sketch to approximate NDV; and ii) sampling-based methods, which estimate NDV using sampling data rather than accessing the entire data warehouse. Scanning-based methods achieve a lower approximation error at the cost of higher I/O and more time. Sampling-based estimation is preferable in applications with a large data volume and a permissible error restriction due to its higher scalability. However, while the sampling-based method is more effective on a single machine, it is less practical in a distributed environment with massive data volumes. For obtaining the final NDV estimators, the entire sample must be transferred throughout the distributed system, incurring a prohibitive communication cost when the sample rate is significant. This paper proposes a novel sketch-based distributed method that achieves sub-linear communication costs for distributed sampling-based NDV estimation under mild assumptions. Our method leverages a sketch-based algorithm to estimate the sample's frequency of frequency in the distributed streaming model, which is compatible with most classical sampling-based NDV estimators. Additionally, we provide theoretical evidence for our method's ability to minimize communication costs in the worst-case scenario. Extensive experiments show that our method saves orders of magnitude in communication costs compared to existing sampling- and sketch-based methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2018

Count-Min: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions

The Count-Min sketch is an important and well-studied data summarization...
research
08/02/2019

Network Shrinkage Estimation

Networks are a natural representation of complex systems across the scie...
research
09/12/2022

Finite Sample Guarantees for Distributed Online Parameter Estimation with Communication Costs

We study the problem of estimating an unknown parameter in a distributed...
research
09/12/2017

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation

We introduce and study a new data sketch for processing massive datasets...
research
02/04/2023

An Effective and Differentially Private Protocol for Secure Distributed Cardinality Estimation

Counting the number of distinct elements distributed over multiple data ...
research
02/24/2021

Density Sketches for Sampling and Estimation

We introduce Density sketches (DS): a succinct online summary of the dat...

Please sign up or login with your details

Forgot password? Click here to reset