Caching and Reproducibility: Making Data Science experiments faster and FAIRer

11/08/2022
by   Moritz Schubotz, et al.
0

Small to medium-scale data science experiments often rely on research software developed ad-hoc by individual scientists or small teams. Often there is no time to make the research software fast, reusable, and open access. The consequence is twofold. First, subsequent researchers must spend significant work hours building upon the proposed hypotheses or experimental framework. In the worst case, others cannot reproduce the experiment and reuse the findings for subsequent research. Second, suppose the ad-hoc research software fails during often long-running computationally expensive experiments. In that case, the overall effort to iteratively improve the software and rerun the experiments creates significant time pressure on the researchers. We suggest making caching an integral part of the research software development process, even before the first line of code is written. This article outlines caching recommendations for developing research software in data science projects. Our recommendations provide a perspective to circumvent common problems such as propriety dependence, speed, etc. At the same time, caching contributes to the reproducibility of experiments in the open science workflow. Concerning the four guiding principles, i.e., Findability, Accessibility, Interoperability, and Reusability (FAIR), we foresee that including the proposed recommendation in a research software development will make the data related to that software FAIRer for both machines and humans. We exhibit the usefulness of some of the proposed recommendations on our recently completed research software project in mathematical information retrieval.

READ FULL TEXT
research
08/16/2019

FAIR and Open Computer Science Research Software

In computational science and in computer science, research software is a...
research
11/04/2020

Pitfalls in Machine Learning Research: Reexamining the Development Cycle

Machine learning has the potential to fuel further advances in data scie...
research
07/26/2021

MLDev: Data Science Experiment Automation and Reproducibility Software

In this paper we explore the challenges of automating experiments in dat...
research
12/14/2020

Enabling collaborative data science development with the Ballet framework

While the open-source model for software development has led to successf...
research
01/27/2023

A sustainable infrastructure concept for improved accessibility, reusability, and archival of research software

Research software is an integral part of most research today and it is w...
research
10/11/2021

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

Enterprises and labs performing computationally expensive data science a...
research
08/03/2018

DataDeps.jl: Repeatable Data Setup for Replicable Data Science

We present DataDeps.jl: a julia package for the reproducible handling of...

Please sign up or login with your details

Forgot password? Click here to reset