Scaling Author Name Disambiguation with CNF Blocking

09/27/2017
by   Kunho Kim, et al.
0

An author name disambiguation (AND) algorithm identifies a unique author entity record from all similar or same publication records in scholarly or similar databases. Typically, a clustering method is used that requires calculation of similarities between each possible record pair. However, the total number of pairs grows quadratically with the size of the author database making such clustering difficult for millions of records. One remedy for this is a blocking function that reduces the number of pairwise similarity calculations. Here, we introduce a new way of learning blocking schemes by using a conjunctive normal form (CNF) in contrast to the disjunctive normal form (DNF). We demonstrate on PubMed author records that CNF blocking reduces more pairs while preserving high pairs completeness compared to the previous methods that use a DNF with the computation time significantly reduced. Thus, these concepts in scholarly data can be better represented with CNFs. Moreover, we also show how to ensure that the method produces disjoint blocks so that the rest of the AND algorithm can be easily paralleled. Our CNF blocking tested on the entire PubMed database of 80 million author mentions efficiently removes 82.17

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/07/2019

AutoBlock: A Hands-off Blocking Framework for Entity Matching

Entity matching seeks to identify data records over one or multiple data...
research
08/19/2020

Scalable Blocking for Very Large Databases

In the field of database deduplication, the goal is to find approximatel...
research
05/28/2020

Efficient and Effective ER with Progressive Blocking

Blocking is a mechanism to improve the efficiency of Entity Resolution (...
research
05/18/2020

A Bayesian Multi-Layered Record Linkage Procedure to Analyze Functional Status of Medicare Patients with Traumatic Brain Injury

Understanding the association between injury severity and patients' pote...
research
02/05/2021

Generating automatically labeled data for author name disambiguation: An iterative clustering method

To train algorithms for supervised author name disambiguation, many stud...
research
02/05/2021

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

Author name disambiguation results are often evaluated by measures such ...
research
03/18/2020

A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits

The data collected from open source projects provide means to model larg...

Please sign up or login with your details

Forgot password? Click here to reset