Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors

05/25/2023
by   Jesse Cummings, et al.
0

We present a straightforward statistical test to detect certain violations of the assumption that the data are Independent and Identically Distributed (IID). The specific form of violation considered is common across real-world applications: whether the examples are ordered in the dataset such that almost adjacent examples tend to have more similar feature values (e.g. due to distributional drift, or attractive interactions between datapoints). Based on a k-Nearest Neighbors estimate, our approach can be used to audit any multivariate numeric data as well as other data types (image, text, audio, etc.) that can be numerically represented, perhaps with model embeddings. Compared with existing methods to detect drift or auto-correlation, our approach is both applicable to more types of data and also able to detect a wider variety of IID violations in practice. Code: https://github.com/cleanlab/cleanlab

READ FULL TEXT

page 3

page 6

page 7

research
09/15/2018

Detecting and Explaining Drifts in Yearly Grant Applications

During the lifetime of a Business Process changes can be made to the wor...
research
05/28/2023

k-NNN: Nearest Neighbors of Neighbors for Anomaly Detection

Anomaly detection aims at identifying images that deviate significantly ...
research
03/31/2017

On the Reliable Detection of Concept Drift from Streaming Unlabeled Data

Classifiers deployed in the real world operate in a dynamic environment,...
research
02/11/2021

Tackling Virtual and Real Concept Drifts: An Adaptive Gaussian Mixture Model

Real-world applications have been dealing with large amounts of data tha...
research
07/30/2021

Tiny Machine Learning for Concept Drift

Tiny Machine Learning (TML) is a new research area whose goal is to desi...
research
10/06/2022

Evaluating k-NN in the Classification of Data Streams with Concept Drift

Data streams are often defined as large amounts of data flowing continuo...
research
05/06/2022

PARAFAC2×N: Coupled Decomposition of Multi-modal Data with Drift in N Modes

Reliable analysis of comprehensive two-dimensional gas chromatography - ...

Please sign up or login with your details

Forgot password? Click here to reset