Personal Names Popularity Estimation and its Application to Record Linkage

11/13/2018
by   Ksenia Zhagorina, et al.
0

This study deals with a fairly simply formulated problem -- how to estimate the number of people bearing the same full name in a large population. Estimation of name popularity can leverage personal name matching in databases and be of interest for many other domains. A distinctive feature of large collections of names is that they contain a large number of unique items, which is challenging for statistical modeling. We investigate a number of statistical techniques and also propose a simple yet effective method aimed at obtaining more accurate count estimates. In our experiments we use a dataset containing about 20 million name occurrences that correspond to about 13 million real-world persons. We perform a thorough evaluation of the name count estimation methods and a record linkage experiment guided by name popularity estimates. Obtained results suggest that theoretically informed approaches outperform simple heuristics and can be useful in a variety of applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/22/2021

HANA: A HAndwritten NAme Database for Offline Handwritten Text Recognition

Methods for linking individuals across historical data sets, typically i...
research
06/26/2018

Record Linkage to Match Customer Names: A Probabilistic Approach

Consider the following problem: given a database of records indexed by n...
research
12/29/2017

Personal Names in Modern Turkey

We analyzed the most common 5000 male and 5000 female Turkish names base...
research
12/13/2016

Application of Advanced Record Linkage Techniques for Complex Population Reconstruction

Record linkage is the process of identifying records that refer to the s...
research
04/19/2021

Large Scale Record Linkage in the Presence of Missing Data

Record linkage is aimed at the accurate and efficient identification of ...
research
03/01/2020

Feature Engineering for Entity Resolution with Arabic Names: Improving Estimates of Observed Casualties in the Syrian Civil War

Entity resolution or record linkage is the task of identifying records r...
research
09/09/2017

Matrix and Graph Operations for Relationship Inference: An Illustration with the Kinship Inference in the China Biographical Database

Biographical databases contain diverse information about individuals. Pe...

Please sign up or login with your details

Forgot password? Click here to reset