On sampling from data with duplicate records

08/24/2020
by   Alireza Heidari, et al.
6

Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of duplicates. We accomplish this by a two-stage process. In the first step, we estimate the frequencies of all the entities in the database. In the second step, we use rejection sampling to obtain a (approximately) uniform sample from the set of entities. However, efficiently estimating the frequency of all the entities is a non-trivial task and not attainable in the general case. Hence, we consider various natural properties of the data under which such frequency estimation (and consequently uniform sampling) is possible. Under each of those assumptions, we provide sampling algorithms and give proofs of the complexity (both statistical and computational) of our approach. We complement our study by conducting extensive experiments on both real and synthetic datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2020

Record fusion: A learning approach

Record fusion is the task of aggregating multiple records that correspon...
research
10/07/2017

Unique Entity Estimation with Application to the Syrian Conflict

Entity resolution identifies and removes duplicate entities in large, no...
research
10/31/2011

Query-time Entity Resolution

Entity resolution is the problem of reconciling database references corr...
research
11/14/2022

A Uniform Sampling Procedure for Abstract Triangulations of Surfaces

We present a procedure to sample uniformly from the set of combinatorial...
research
03/12/2023

Non-Trivial Query Sampling For Efficient Learning To Plan

In recent years, learning-based approaches have revolutionized motion pl...
research
09/21/2022

Estimation of circular statistics in the presence of measurement bias

Background and objective. Circular statistics and Rayleigh tests are imp...
research
05/20/2020

Non-Uniform Gaussian Blur of Hexagonal Bins in Cartesian Coordinates

In a recent application of the Bokeh Python library for visualizing phys...

Please sign up or login with your details

Forgot password? Click here to reset