Machine Learning Pipelines: Provenance, Reproducibility and FAIR Data Principles

06/22/2020
by   Sheeba Samuel, et al.
0

Machine learning (ML) is an increasingly important scientific tool supporting decision making and knowledge generation in numerous fields. With this, it also becomes more and more important that the results of ML experiments are reproducible. Unfortunately, that often is not the case. Rather, ML, similar to many other disciplines, faces a reproducibility crisis. In this paper, we describe our goals and initial steps in supporting the end-to-end reproducibility of ML pipelines. We investigate which factors beyond the availability of source code and datasets influence reproducibility of ML experiments. We propose ways to apply FAIR data practices to ML workflows. We present our preliminary results on the role of our tool, ProvBook, in capturing and comparing provenance of ML experiments and their reproducibility using Jupyter Notebooks.

READ FULL TEXT

page 1

page 2

research
07/19/2023

Reproducibility in Machine Learning-Driven Research

Research is facing a reproducibility crisis, in which the results and fi...
research
03/21/2023

Reasonable Scale Machine Learning with Open-Source Metaflow

As Machine Learning (ML) gains adoption across industries and new use ca...
research
03/16/2023

The NCI Imaging Data Commons as a platform for reproducible research in computational pathology

Objective: Reproducibility is critical for translating machine learning-...
research
09/30/2020

Workflow Provenance in the Lifecycle of Scientific Machine Learning

Machine Learning (ML) has already fundamentally changed several business...
research
01/13/2021

Whither AutoML? Understanding the Role of Automation in Machine Learning Workflows

Efforts to make machine learning more widely accessible have led to a ra...
research
09/02/2021

Quantifying Reproducibility in NLP and ML

Reproducibility has become an intensely debated topic in NLP and ML over...
research
02/09/2023

REIN: A Comprehensive Benchmark Framework for Data Cleaning Methods in ML Pipelines

Nowadays, machine learning (ML) plays a vital role in many aspects of ou...

Please sign up or login with your details

Forgot password? Click here to reset