Subset Sampling and Its Extensions

07/21/2023
āˆ™
by   Jinchao Huang, et al.
āˆ™
0
āˆ™

This paper studies the subset sampling problem. The input is a set š’® of n records together with a function p that assigns each record vāˆˆš’® a probability p(v). A query returns a random subset X of š’®, where each record vāˆˆš’® is sampled into X independently with probability p(v). The goal is to store š’® in a data structure to answer queries efficiently. If š’® fits in memory, the problem is interesting when š’® is dynamic. We develop a dynamic data structure with š’Ŗ(1+Ī¼_š’®) expected query time, š’Ŗ(n) space and š’Ŗ(1) amortized expected update, insert and delete time, where Ī¼_š’®=āˆ‘_vāˆˆš’®p(v). The query time and space are optimal. If š’® does not fit in memory, the problem is difficult even if š’® is static. Under this scenario, we present an I/O-efficient algorithm that answers a query in š’Ŗ((log^*_B n)/B+(Ī¼_š’®/B)log_M/B (n/B)) amortized expected I/Os using š’Ŗ(n/B) space, where M is the memory size, B is the block size and log^*_B n is the number of iterative log_2(.) operations we need to perform on n before going below B. In addition, when each record is associated with a real-valued key, we extend the subset sampling problem to the range subset sampling problem, in which we require that the keys of the sampled records fall within a specified input range [a,b]. For this extension, we provide a solution under the dynamic setting, with š’Ŗ(log n+Ī¼_š’®āˆ©[a,b]) expected query time, š’Ŗ(n) space and š’Ŗ(log n) amortized expected update, insert and delete time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
āˆ™ 07/10/2018

Improved Time and Space Bounds for Dynamic Range Mode

Given an array A of n elements, we wish to support queries for the most ...
research
āˆ™ 05/30/2023

Optimal Dynamic Subset Sampling: Theory and Applications

We study the fundamental problem of sampling independent events, called ...
research
āˆ™ 09/12/2017

Skyline Queries in O(1) time?

The skyline of a set P of points (SKY(P)) consists of the "best" points ...
research
āˆ™ 05/02/2023

Connectivity Queries under Vertex Failures: Not Optimal, but Practical

We revisit once more the problem of designing an oracle for answering co...
research
āˆ™ 01/17/2019

Generating Pareto records

We present, (partially) analyze, and apply an efficient algorithm for th...
research
āˆ™ 02/24/2021

Durable Top-K Instant-Stamped Temporal Records with User-Specified Scoring Functions

A way of finding interesting or exceptional records from instant-stamped...
research
āˆ™ 11/13/2020

Kernel Density Estimation through Density Constrained Near Neighbor Search

In this paper we revisit the kernel density estimation problem: given a ...

Please sign up or login with your details

Forgot password? Click here to reset