Vamsa: Tracking Provenance in Data Science Scripts

by   Mohammad Hossein Namaki, et al.

Machine learning (ML) which was initially adopted for search ranking and recommendation systems has firmly moved into the realm of core enterprise operations like sales optimization and preventative healthcare. For such ML applications, often deployed in regulated environments, the standards for user privacy, security, and data governance are substantially higher. This imposes the need for tracking provenance end-to-end, from the data sources used for training ML models to the predictions of the deployed models. In this work, we take a first step towards this direction by introducing the ML provenance tracking problem in the context of data science scripts. The fundamental idea is to automatically identify the relationships between data and ML models and in particular, to track which columns in a dataset have been used to derive the features of a ML model. We discuss the challenges in capturing such provenance information in the context of Python, the most common language used by data scientists. We then, present Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the user's code. Using up to 450K real-world data science scripts from Kaggle and publicly available Python notebooks, we verify the effectiveness of Vamsa in terms of coverage, and performance. We also evaluate Vamsa's accuracy on a smaller subset of manually labeled data. Our analysis shows that Vamsa's precision and recall range from 87.5 the order of milliseconds for scripts of average size.


page 1

page 2

page 3

page 4


SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle

Machine learning (ML) applications become increasingly common in many do...

AutoDS: Towards Human-Centered Automation of Data Science

Data science (DS) projects often follow a lifecycle that consists of lab...

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Program code as a data source is gaining popularity in the data science ...

Augmented Data Science: Towards Industrialization and Democratization of Data Science

Conversion of raw data into insights and knowledge requires substantial ...

Principles and Practice of Explainable Machine Learning

Artificial intelligence (AI) provides many opportunities to improve priv...

Numeracy from Literacy: Data Science as an Emergent Skill from Large Language Models

Large language models (LLM) such as OpenAI's ChatGPT and GPT-3 offer uni...

Landscape of High-performance Python to Develop Data Science and Machine Learning Applications

Python has become the prime language for application development in the ...

Please sign up or login with your details

Forgot password? Click here to reset