A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature

by   Sara Lafia, et al.

Discovering authoritative links between publications and the datasets that they use can be a labor-intensive process. We introduce a natural language processing pipeline that retrieves and reviews publications for informal references to research datasets, which complements the work of data librarians. We first describe the components of the pipeline and then apply it to expand an authoritative bibliography linking thousands of social science studies to the data-related publications in which they are used. The pipeline increases recall for literature to review for inclusion in data-related collections of publications and makes it possible to detect informal data references at scale. We contribute (1) a novel Named Entity Recognition (NER) model that reliably detects informal data references and (2) a dataset connecting items from social science literature with datasets they reference. Together, these contributions enable future work on data reference, data citation networks, and data reuse.


page 2

page 5


The number of linked references of publications in Microsoft Academic in comparison with the Web of Science

In the context of a comprehensive Microsoft Academic (MA) study, we expl...

Librarian-in-the-Loop: A Natural Language Processing Paradigm for Detecting Informal Mentions of Research Data in Academic Literature

Data citations provide a foundation for studying research data impact. C...

Incidental or influential? - Challenges in automatically detecting citation importance using publication full texts

This work looks in depth at several studies that have attempted to autom...

Text Mining for Processing Interview Data in Computational Social Science

We use commercially available text analysis technology to process interv...

In Search of Outstanding Research Advances: Prototyping the creation of an open dataset of "editorial highlights"

A long-standing research question in bibliometrics is how one identifies...

Citation Data of Czech Apex Courts

In this paper, we introduce the citation data of the Czech apex courts (...

Building astroBERT, a language model for Astronomy Astrophysics

The existing search tools for exploring the NASA Astrophysics Data Syste...