A large dataset of software mentions in the biomedical literature

09/01/2022
by   Ana-Maria Istrate, et al.
0

We describe the CZ Software Mentions dataset, a new dataset of software mentions in biomedical papers. Plain-text software mentions are extracted with a trained SciBERT model from several sources: the NIH PubMed Central collection and from papers provided by various publishers to the Chan Zuckerberg Initiative. The dataset provides sources, context and metadata, and, for a number of mentions, the disambiguated software entities and links. We extract 1.12 million unique string software mentions from 2.4 million papers in the NIH PMC-OA Commercial subset, 481k unique mentions from the NIH PMC-OA Non-Commercial subset (both gathered in October 2021) and 934k unique mentions from 3 million papers in the Publishers' collection. There is variation in how software is mentioned in papers and extracted by the NER algorithm. We propose a clustering-based disambiguation algorithm to map plain-text software mentions into distinct software entities and apply it on the NIH PubMed Central Commercial collection. Through this methodology, we disambiguate 1.12 million unique strings extracted by the NER model into 97600 unique software entities, covering 78 repository, covering about 55 detail the process of building the datasets, disambiguating and linking the software mentions, as well as opportunities and challenges that come with a dataset of this size. We make all data and code publicly available as a new resource to help assess the impact of software (in particular scientific open source projects) on science.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2021

Science-Software Linkage: The Challenges of Traceability between Scientific Knowledge and Software Artifacts

Although computer science papers are often accompanied by software artif...
research
01/26/2021

Low Resource Recognition and Linking of Biomedical Concepts from a Large Ontology

Tools to explore scientific literature are essential for scientists, esp...
research
04/01/2022

A Large-scale Dataset of (Open Source) License Text Variants

We introduce a large-scale dataset of the complete texts of free/open so...
research
08/22/2023

The Software Heritage License Dataset (2022 Edition)

Context: When software is released publicly, it is common to include wit...
research
04/22/2020

CORD-19: The COVID-19 Open Research Dataset

The COVID-19 Open Research Dataset (CORD-19) is a growing resource of sc...
research
06/20/2023

Fingerprinting and Building Large Reproducible Datasets

Obtaining a relevant dataset is central to conducting empirical studies ...
research
03/18/2020

A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits

The data collected from open source projects provide means to model larg...

Please sign up or login with your details

Forgot password? Click here to reset