A Recommender System for Scientific Datasets and Analysis Pipelines

08/20/2021
by   Mandana Mazaheri, et al.
0

Scientific datasets and analysis pipelines are increasingly being shared publicly in the interest of open science. However, mechanisms are lacking to reliably identify which pipelines and datasets can appropriately be used together. Given the increasing number of high-quality public datasets and pipelines, this lack of clear compatibility threatens the findability and reusability of these resources. We investigate the feasibility of a collaborative filtering system to recommend pipelines and datasets based on provenance records from previous executions. We evaluate our system using datasets and pipelines extracted from the Canadian Open Neuroscience Platform, a national initiative for open neuroscience. The recommendations provided by our system (AUC=0.83) are significantly better than chance and outperform recommendations made by domain experts using their previous knowledge as well as pipeline and dataset descriptions (AUC=0.63). In particular, domain experts often neglect low-level technical aspects of a pipeline-dataset interaction, such as the level of pre-processing, which are captured by a provenance-based system. We conclude that provenance-based pipeline and dataset recommenders are feasible and beneficial to the sharing and usage of open-science resources. Future work will focus on the collection of more comprehensive provenance traces, and on deploying the system in production.

READ FULL TEXT

page 1

page 5

research
03/20/2016

Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science

As the field of data science continues to grow, there will be an ever-in...
research
05/15/2017

Probabilistic Matrix Factorization for Automated Machine Learning

In order to achieve state-of-the-art performance, modern machine learnin...
research
10/17/2020

MLCask: Efficient Management of Component Evolution in Collaborative Data Analytics Pipelines

With the ever-increasing adoption of machine learning for data analytics...
research
01/03/2022

Recommendations for repositories and scientific gateways from a neuroscience perspective

Digital services such as repositories and science gateways have become k...
research
10/04/2022

Integrating pre-processing pipelines in ODC based framework

Using on-demand processing pipelines to generate virtual geospatial prod...
research
01/03/2018

Prediction of corrosions in Gas and Oil pipelines based on the theory of records

Predictions of corrosions in pipelines are valuable. Based on the availa...
research
01/09/2019

duneuro - A software toolbox for forward modeling in neuroscience

This paper describes duneuro, a software toolbox for forward modeling in...

Please sign up or login with your details

Forgot password? Click here to reset