LAGOS-AND: A Large, Gold Standard Dataset for Scholarly Author Name Disambiguation

04/05/2021
by   Li Zhang, et al.
0

In this paper, we present a method to automatically generate a large-scale labeled dataset for author name disambiguation (AND) in the academic world by leveraging authoritative sources, ORCID and DOI. Using the method, we built LAGOS-AND, a large, gold standard dataset for AND, which is substantially different from existing ones. It contains 7.5M citations authored by 797K unique authors and shows close similarities to the entire Microsoft Academic Graph (MAG) across six gold standard validations. In building the dataset, we investigated the long-standing name synonym problem and revealed the degree of variation in the last name for the first time. Evidence from PubMed, MAG, and Semantic Scholar all suggests that there are  7.5 varied their last names from the credible last names in the ORCID system when ignoring the variants introduced by special characters. Furthermore, we provided a classification-based AND benchmark on the new dataset and released our model for disambiguation in general scenarios. If this work is helpful for future studies, we believe it will challenge (1) the widely accepted block-based disambiguation framework in production environment and, (2) the state-of-the-art methods or models on AND. The code, dataset, and pre-trained model are publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/17/2023

Deep Author Name Disambiguation using DBLP Data

In the academic world, the number of scientists grows every year and so ...
research
07/09/2021

Bib2Auth: Deep Learning Approach for Author Disambiguation using Bibliographic Data

Author name ambiguity remains a critical open problem in digital librari...
research
11/01/2022

A Bayesian Learning, Greedy agglomerative clustering approach and evaluation techniques for Author Name Disambiguation Problem

Author names often suffer from ambiguity owing to the same author appear...
research
02/05/2021

ORCID-linked labeled data for evaluating author name disambiguation at scale

How can we evaluate the performance of a disambiguation method implement...
research
02/22/2018

Content-Based Citation Recommendation

We present a content-based method for recommending citations in an acade...
research
11/07/2016

Presenting a New Dataset for the Timeline Generation Problem

The timeline generation task summarises an entity's biography by selecti...
research
03/10/2020

Large-Scale Evaluation of Keyphrase Extraction Models

Keyphrase extraction models are usually evaluated under different, not d...

Please sign up or login with your details

Forgot password? Click here to reset