Abstract Interpretation-Based Data Leakage Static Analysis

11/29/2022
by   Filip Drobnjaković, et al.
0

Data leakage is a well-known problem in machine learning. Data leakage occurs when information from outside the training dataset is used to create a model. This phenomenon renders a model excessively optimistic or even useless in the real world since the model tends to leverage greatly on the unfairly acquired information. To date, detection of data leakages occurs post-mortem using run-time methods. However, due to the insidious nature of data leakage, it may not be apparent to a data scientist that a data leakage has occurred in the first place. For this reason, it is advantageous to detect data leakages as early as possible in the development life cycle. In this paper, we propose a novel static analysis to detect several instances of data leakages during development time. We define our analysis using the framework of abstract interpretation: we define a concrete semantics that is sound and complete, from which we derive a sound and computable abstract semantics. We implement our static analysis inside the open-source NBLyzer static analysis framework and demonstrate its utility by evaluating its performance and precision on over 2000 Kaggle competition notebooks.

READ FULL TEXT

page 16

page 18

research
09/07/2022

Data Leakage in Notebooks: Static Detection and Better Processes

Data science pipelines to train and evaluate models with machine learnin...
research
11/20/2017

Abstract Interpretation of Binary Code with Memory Accesses using Polyhedra

In this paper we propose a novel methodology for static analysis of bina...
research
04/27/2018

Sound up-to techniques and Complete abstract domains

Abstract interpretation is a method to automatically find invariants of ...
research
11/30/2018

Thinging Machine applied to Information Leakage

This paper introduces a case study that involves data leakage in a bank ...
research
08/09/2022

STELLA: Sparse Taint Analysis for Enclave Leakage Detection

Intel SGX (Software Guard Extension) is a promising TEE (trusted executi...
research
11/17/2022

Completeness in static analysis by abstract interpretation, a personal point of view

Static analysis by abstract interpretation is generally designed to be ”...
research
03/10/2022

TIDF-DLPM: Term and Inverse Document Frequency based Data Leakage Prevention Model

Confidentiality of the data is being endangered as it has been categoriz...

Please sign up or login with your details

Forgot password? Click here to reset