An Alternative to Cells for Selective Execution of Data Science Pipelines

02/28/2023
by   Lars Reimann, et al.
0

Data Scientists often use notebooks to develop Data Science (DS) pipelines, particularly since they allow to selectively execute parts of the pipeline. However, notebooks for DS have many well-known flaws. We focus on the following ones in this paper: (1) Notebooks can become littered with code cells that are not part of the main DS pipeline but exist solely to make decisions (e.g. listing the columns of a tabular dataset). (2) While users are allowed to execute cells in any order, not every ordering is correct, because a cell can depend on declarations from other cells. (3) After making changes to a cell, this cell and all cells that depend on changed declarations must be rerun. (4) Changes to external values necessitate partial re-execution of the notebook. (5) Since cells are the smallest unit of execution, code that is unaffected by changes, can inadvertently be re-executed. To solve these issues, we propose to replace cells as the basis for the selective execution of DS pipelines. Instead, we suggest populating a context-menu for variables with actions fitting their type (like listing columns if the variable is a tabular dataset). These actions are executed based on a data-flow analysis to ensure dependencies between variables are respected and results are updated properly after changes. Our solution separates pipeline code from decision making code and automates dependency management, thus reducing clutter and the risk of making errors.

READ FULL TEXT

page 1

page 3

research
12/13/2020

Fine-Grained Lineage for Safer Notebook Interactions

Computational notebooks have emerged as the platform of choice for data ...
research
08/05/2021

JITA4DS: Disaggregated execution of Data Science Pipelines between the Edge and the Data Centre

This paper targets the execution of data science (DS) pipelines supporte...
research
06/29/2022

The Vera C. Rubin Observatory Data Butler and Pipeline Execution System

The Rubin Observatory's Data Butler is designed to allow data file locat...
research
07/01/2021

Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds

Interactive computing notebooks, such as Jupyter notebooks, have become ...
research
03/06/2023

Data management and execution systems for the Rubin Observatory Science Pipelines

We present the Rubin Observatory system for data storage/retrieval and p...
research
11/17/2022

Execution-based Evaluation for Data Science Code Generation Models

Code generation models can benefit data scientists' productivity by auto...
research
11/14/2004

Statistical Mechanics Characterization of Neuronal Mosaics

The spatial distribution of neuronal cells is an important requirement f...

Please sign up or login with your details

Forgot password? Click here to reset