Fine-Grained Lineage for Safer Notebook Interactions

12/13/2020
by   Stephen Macke, et al.
0

Computational notebooks have emerged as the platform of choice for data science and analytical workflows, enabling rapid iteration and exploration. By keeping intermediate program state in memory and segmenting units of execution into so-called "cells", notebooks allow users to execute their workflows interactively and enjoy particularly tight feedback. However, as cells are added, removed, reordered, and rerun, this hidden intermediate state accumulates in a way that is not necessarily correlated with the code visible in the notebook's cells, making execution behavior difficult to reason about, and leading to errors and lack of reproducibility. We present NBSafety, a custom Jupyter kernel that uses runtime tracing and static analysis to automatically manage lineage associated with cell execution and global notebook state. NBSafety detects and prevents errors that users make during unaided notebook interactions, all while preserving the flexibility of existing notebook semantics. We evaluate NBSafety's ability to prevent erroneous interactions by replaying and analyzing 666 real notebook sessions. Of these, NBSafety identified 117 sessions with potential safety errors, and in the remaining 549 sessions, the cells that NBSafety identified as resolving safety issues were more than 7× more likely to be selected by users for re-execution compared to a random baseline, even though the users were not using NBSafety and were therefore not influenced by its suggestions.

READ FULL TEXT
research
02/28/2023

An Alternative to Cells for Selective Execution of Data Science Pipelines

Data Scientists often use notebooks to develop Data Science (DS) pipelin...
research
11/17/2022

Execution-based Evaluation for Data Science Code Generation Models

Code generation models can benefit data scientists' productivity by auto...
research
03/07/2022

Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions

The execution behavior of a program often depends on external resources,...
research
07/01/2021

Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds

Interactive computing notebooks, such as Jupyter notebooks, have become ...
research
06/24/2021

SecureDL: Securing Code Execution and Access Control for Distributed Data Analytics Platforms

Distributed data analytics platforms such as Apache Spark enable cost-ef...
research
04/14/2023

Eunomia: Enabling User-specified Fine-Grained Search in Symbolically Executing WebAssembly Binaries

Although existing techniques have proposed automated approaches to allev...
research
08/03/2023

Unleashing Unprivileged eBPF Potential with Dynamic Sandboxing

For safety reasons, unprivileged users today have only limited ways to c...

Please sign up or login with your details

Forgot password? Click here to reset