Neural Relation Graph for Identifying Problematic Data

01/29/2023
by   Jang-Hyun Kim, et al.
0

Diagnosing and cleaning datasets are crucial for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is difficult due to the presence of complex issues, such as label errors or under-representation of certain types. In this paper, we propose a novel approach for identifying problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. We develop an efficient algorithm for detecting label errors and outlier data points based on the relational graph structure of the dataset. We further introduce a visualization tool for contextualizing data points, which can serve as an effective tool for interactively diagnosing datasets. We evaluate label error and out-of-distribution detection performances on large-scale image and language domain tasks, including ImageNet and GLUE benchmarks, and demonstrate the effectiveness of our approach for debugging datasets and building robust machine learning systems.

READ FULL TEXT

page 2

page 3

page 7

page 14

page 15

page 16

research
05/25/2022

Detecting Label Errors using Pre-Trained Language Models

We show that large pre-trained language models are extremely capable of ...
research
03/02/2018

Label Sanitization against Label Flipping Poisoning Attacks

Many machine learning systems rely on data collected in the wild from un...
research
09/13/2017

Visualization of Big Spatial Data using Coresets for Kernel Density Estimates

The size of large, geo-located datasets has reached scales where visuali...
research
03/07/2023

Predicted Embedding Power Regression for Large-Scale Out-of-Distribution Detection

Out-of-distribution (OOD) inputs can compromise the performance and safe...
research
05/31/2023

Auto-Differentiation of Relational Computations for Very Large Scale Machine Learning

The relational data model was designed to facilitate large-scale data ma...
research
07/29/2019

Computing the Value of Data: Towards Applied Data Minimalism

We present an approach to compute the monetary value of individual data ...
research
07/17/2018

Analyzing Hypersensitive AI: Instability in Corporate-Scale Machine Learning

Predictive geometric models deliver excellent results for many Machine L...

Please sign up or login with your details

Forgot password? Click here to reset