A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature

05/23/2022
by   Sara Lafia, et al.
0

Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.

READ FULL TEXT

page 2

page 5

research
10/11/2017

The number of linked references of publications in Microsoft Academic in comparison with the Web of Science

In the context of a comprehensive Microsoft Academic (MA) study, we expl...
research
03/10/2022

Librarian-in-the-Loop: A Natural Language Processing Paradigm for Detecting Informal Mentions of Research Data in Academic Literature

Data citations provide a foundation for studying research data impact. C...
research
02/16/2023

How and Why do Researchers Reference Data? A Study of Rhetorical Features and Functions of Data References in Academic Articles

Data reuse is a common practice in the social sciences. While published ...
research
07/13/2017

Incidental or influential? - Challenges in automatically detecting citation importance using publication full texts

This work looks in depth at several studies that have attempted to autom...
research
11/28/2020

Text Mining for Processing Interview Data in Computational Social Science

We use commercially available text analysis technology to process interv...
research
04/28/2023

A model for reference list length of scholarly articles

We introduce and analyse a simple probabilistic model of article product...
research
11/16/2020

In Search of Outstanding Research Advances: Prototyping the creation of an open dataset of "editorial highlights"

A long-standing research question in bibliometrics is how one identifies...

Please sign up or login with your details

Forgot password? Click here to reset