Distinct Sampling on Streaming Data with Near-Duplicates

10/29/2018
by   Jiecao Chen, et al.
0

In this paper we study how to perform distinct sampling in the streaming model where data contain near-duplicates. The goal of distinct sampling is to return a distinct element uniformly at random from the universe of elements, given that all the near-duplicates are treated as the same element. We also extend the result to the sliding window cases in which we are only interested in the most recent items. We present algorithms with provable theoretical guarantees for datasets in the Euclidean space, and also verify their effectiveness via an extensive set of experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/10/2019

Parallel Streaming Random Sampling

This paper investigates parallel random sampling from a potentially-unen...
research
06/08/2023

Analysis of Knuth's Sampling Algorithm D and D'

In this research paper, we address the Distinct Elements estimation prob...
research
05/01/2018

Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows

We study the distinct elements and ℓ_p-heavy hitters problems in the sli...
research
09/20/2023

Testing frequency distributions in a stream

We study how to verify specific frequency distributions when we observe ...
research
01/24/2023

Distinct Elements in Streams: An Algorithm for the (Text) Book

Given a data stream 𝒟 = ⟨ a_1, a_2, …, a_m ⟩ of m elements where each a_...
research
08/12/2020

On Uniformly Sampling Traces of a Transition System (Extended Version)

A key problem in constrained random verification (CRV) concerns generati...
research
03/11/2023

Generalizing Greenwald-Khanna Streaming Quantile Summaries for Weighted Inputs

Estimating quantiles, like the median or percentiles, is a fundamental t...

Please sign up or login with your details

Forgot password? Click here to reset