Unique Entity Estimation with Application to the Syrian Conflict

10/07/2017
by   Beidi Chen, et al.
0

Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult task for real applications. In this paper, we focus on a related problem of unique entity estimation, which is the task of estimating the unique number of entities and associated standard errors in a data set with duplicate entities. Unique entity estimation shares many fundamental challenges of entity resolution, namely, that the computational cost of all-to-all entity comparisons is intractable for large databases. To circumvent this computational barrier, we propose an efficient (near-linear time) estimation algorithm based on locality sensitive hashing. Our estimator, under realistic assumptions, is unbiased and has provably low variance compared to existing random sampling based approaches. In addition, we empirically show its superiority over the state-of-the-art estimators on three real applications. The motivation for our work is to derive an accurate estimate of the documented, identifiable deaths in the ongoing Syrian conflict. Our methodology, when applied to the Syrian data set, provides an estimate of 191,874 ± 1772 documented, identifiable deaths, which is very close to the Human Rights Data Analysis Group (HRDAG) estimate of 191,369. Our work provides an example of challenges and efforts involved in solving a real, noisy challenging problem where modeling assumptions may not hold.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2018

Probabilistic Blocking with An Application to the Syrian Conflict

Entity resolution seeks to merge databases as to remove duplicate entrie...
research
03/01/2020

Feature Engineering for Entity Resolution with Arabic Names: Improving Estimates of Observed Casualties in the Syrian Civil War

Entity resolution or record linkage is the task of identifying records r...
research
08/24/2020

On sampling from data with duplicate records

Data deduplication is the task of detecting records in a database that c...
research
08/10/2020

(Almost) All of Entity Resolution

Whether the goal is to estimate the number of people that live in a cong...
research
09/14/2015

A Practioner's Guide to Evaluating Entity Resolution Results

Entity resolution (ER) is the task of identifying records belonging to t...
research
01/07/2021

Controlling Entity Integrity with Key Sets

Codd's rule of entity integrity stipulates that every table has a primar...
research
10/03/2022

Estimating the Performance of Entity Resolution Algorithms: Lessons Learned Through PatentsView.org

This paper introduces a novel evaluation methodology for entity resoluti...

Please sign up or login with your details

Forgot password? Click here to reset