Cluster Analysis of Open Research Data and a Case for Replication Metadata

05/26/2023
by   Ana Trisovic, et al.
0

Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65 single-type metadata (such as Dataset, Software or Report), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20 would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility.

READ FULL TEXT
research
08/14/2021

Packaging research artefacts with RO-Crate

An increasing number of researchers support reproducibility by including...
research
04/20/2022

MEDFORD: A human and machine readable metadata markup language

Reproducibility of research is essential for science. However, in the wa...
research
03/23/2022

Towards Reproducible Network Traffic Analysis

Analysis techniques are critical for gaining insight into network traffi...
research
07/18/2022

ir_metadata: An Extensible Metadata Schema for IR Experiments

The information retrieval (IR) community has a strong tradition of makin...
research
03/23/2021

A large-scale study on research code quality and execution

This article presents a study on the quality and execution of research c...
research
09/17/2020

Extensible Data Skipping

Data skipping reduces I/O for SQL queries by skipping over irrelevant da...
research
05/01/2022

StreamingHub: Interactive Stream Analysis Workflows

Reusable data/code and reproducible analyses are foundational to quality...

Please sign up or login with your details

Forgot password? Click here to reset