Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

09/20/2022
by   Shoaib Ahmed Siddiqui, et al.
17

Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in `untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we instead focus on providing a unified and efficient framework for Metadata Archaeology – uncovering and inferring metadata of examples in a dataset. We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these probe suites to infer metadata of interest. Our method is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, classifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples.

READ FULL TEXT

page 2

page 5

page 7

page 8

page 20

research
06/01/2021

AMV : Algorithm Metadata Vocabulary

Metadata vocabularies are used in various domains of study. It provides ...
research
04/17/2022

A Psycho-linguistic Analysis of BitChute

In order to better support researchers, journalist, and practitioners in...
research
03/02/2018

Age Group Classification with Speech and Metadata Multimodality Fusion

Children comprise a significant proportion of TV viewers and it is worth...
research
07/28/2022

Adaptive Second Order Coresets for Data-efficient Machine Learning

Training machine learning models on massive datasets incurs substantial ...
research
03/30/2023

MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries

Metadata quality is crucial for digital objects to be discovered through...
research
05/10/2023

Finding Meaningful Distributions of ML Black-boxes under Forensic Investigation

Given a poorly documented neural network model, we take the perspective ...

Please sign up or login with your details

Forgot password? Click here to reset