The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large

12/02/2021
by   Sumon Biswas, et al.
0

Increasingly larger number of software systems today are including data science components for descriptive, predictive, and prescriptive analytics. The collection of data science stages from acquisition, to cleaning/curation, to modeling, and so on are referred to as data science pipelines. To facilitate research and practice on data science pipelines, it is essential to understand their nature. What are the typical stages of a data science pipeline? How are they connected? Do the pipelines differ in the theoretical representations and that in the practice? Today we do not fully understand these architectural characteristics of data science pipelines. In this work, we present a three-pronged comprehensive study to answer this for the state-of-the-art, data science in-the-small, and data science in-the-large. Our study analyzes three datasets: a collection of 71 proposals for data science pipelines and related concepts in theory, a collection of over 105 implementations of curated data science pipelines from Kaggle competitions to understand data science in-the-small, and a collection of 21 mature data science projects from GitHub to understand data science in-the-large. Our study has led to three representations of data science pipelines that capture the essence of our subjects in theory, in-the-small, and in-the-large.

READ FULL TEXT

page 4

page 9

research
06/28/2020

Data Science: Nature and Pitfalls

Data science is creating very exciting trends as well as significant con...
research
11/25/2021

Federated Data Science to Break Down Silos [Vision]

Similar to Open Data initiatives, data science as a community has launch...
research
04/11/2023

Mining the Characteristics of Jupyter Notebooks in Data Science Projects

Nowadays, numerous industries have exceptional demand for skills in data...
research
03/03/2023

Linked Data Science Powered by Knowledge Graphs

In recent years, we have witnessed a growing interest in data science no...
research
10/21/2021

Viash: from scripts to pipelines

Most bioinformatics pipelines consist of software components that are ti...
research
12/19/2019

Data Science through the looking glass and what we found there

The recent success of machine learning (ML) has led to an explosive grow...
research
03/17/2022

Kan Extensions in Data Science and Machine Learning

A common problem in data science is "use this function defined over this...

Please sign up or login with your details

Forgot password? Click here to reset