Data Leakage in Notebooks: Static Detection and Better Processes

09/07/2022
by   Chenyang Yang, et al.
0

Data science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model's accuracy during offline evaluations, possibly leading to deployment of low-quality models in production. Such leakage can happen easily by mistake or by following poor practices, but may be tedious and challenging to detect manually. We develop a static analysis approach to detect common forms of data leakage in data science code. Our evaluation shows that our analysis accurately detects data leakage and that such leakage is pervasive among over 100,000 analyzed public notebooks. We discuss how our static analysis approach can help both practitioners and educators, and how leakage prevention can be designed into the development process.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/29/2022

Abstract Interpretation-Based Data Leakage Static Analysis

Data leakage is a well-known problem in machine learning. Data leakage o...
research
08/09/2022

STELLA: Sparse Taint Analysis for Enclave Leakage Detection

Intel SGX (Software Guard Extension) is a promising TEE (trusted executi...
research
09/10/2018

Is Leakage Power a Linear Function of Temperature?

In this work, we present a study of the leakage power modeling technique...
research
03/10/2022

TIDF-DLPM: Term and Inverse Document Frequency based Data Leakage Prevention Model

Confidentiality of the data is being endangered as it has been categoriz...
research
10/21/2020

On Offline Evaluation of Recommender Systems

In academic research, recommender models are often evaluated offline on ...
research
09/30/2020

Hidden Markov Models for Pipeline Damage Detection Using Piezoelectric Transducers

Oil and gas pipeline leakages lead to not only enormous economic loss bu...
research
04/04/2023

Ethylene Leak Detection Based on Infrared Imaging: A Benchmark

Ethylene leakage detection has become one of the most important research...

Please sign up or login with your details

Forgot password? Click here to reset