Towards Better Bounds for Finding Quasi-Identifiers

11/25/2022
by   Ryan Hildebrant, et al.
0

We revisit the problem of finding small ϵ-separation keys introduced by Motwani and Xu (2008). In this problem, the input is m-dimensional tuples x_1,x_2,…,x_n. The goal is to find a small subset of coordinates that separates at least (1-ϵ)n 2 pairs of tuples. They provided a fast algorithm that runs on Θ(m/ϵ) tuples sampled uniformly at random. We show that the sample size can be improved to Θ(m/√(ϵ)). Our algorithm also enjoys a faster running time. To obtain this result, we provide upper and lower bounds on the sample size to solve the following decision problem. Given a subset of coordinates A, reject if A separates fewer than (1-ϵ)n 2 pairs, and accept if A separates all pairs. The algorithm must be correct with probability at least 1-δ for all A. We show that for algorithms based on sampling: - Θ(m/√(ϵ)) samples are sufficient and necessary so that δ≤ e^-m and - Ω(√(log m/ϵ)) samples are necessary so that δ is a constant. Our analysis is based on a constrained version of the balls-into-bins problem. We believe our analysis may be of independent interest. We also study a related problem that asks for the following sketching algorithm: with given parameters α,k and ϵ, the algorithm takes a subset of coordinates A of size at most k and returns an estimate of the number of unseparated pairs in A up to a (1±ϵ) factor if it is at least αn 2. We show that even for constant α and success probability, such a sketching algorithm must use Ω(mk logϵ^-1) bits of space; on the other hand, uniform sampling yields a sketch of size Θ(mk log m/αϵ^2) for this purpose.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2021

Algorithms for the Line-Constrained Disk Coverage and Related Problems

Given a set P of n points and a set S of m weighted disks in the plane, ...
research
08/15/2023

Improved Lower Bound for Estimating the Number of Defective Items

Let X be a set of items of size n that contains some defective items, de...
research
08/12/2020

Sampling Based Approximate Skyline Calculation on Big Data

The existing algorithms for processing skyline queries cannot adapt to b...
research
04/04/2023

Online Learning with Adversaries: A Differential Inclusion Analysis

We consider the measurement model Y = AX, where X and, hence, Y are rand...
research
07/05/2021

Memory-Sample Lower Bounds for Learning Parity with Noise

In this work, we show, for the well-studied problem of learning parity u...
research
11/04/2019

Faster Update Time for Turnstile Streaming Algorithms

In this paper, we present a new algorithm for maintaining linear sketche...
research
02/17/2020

Time-Space Tradeoffs for Distinguishing Distributions and Applications to Security of Goldreich's PRG

In this work, we establish lower-bounds against memory bounded algorithm...

Please sign up or login with your details

Forgot password? Click here to reset