Fix your Models by Fixing your Datasets

12/15/2021
by   Atindriyo Sanyal, et al.
0

The quality of underlying training data is very crucial for building performant machine learning models with wider generalizabilty. However, current machine learning (ML) tools lack streamlined processes for improving the data quality. So, getting data quality insights and iteratively pruning the errors to obtain a dataset which is most representative of downstream use cases is still an ad-hoc manual process. Our work addresses this data tooling gap, required to build improved ML workflows purely through data-centric techniques. More specifically, we introduce a systematic framework for (1) finding noisy or mislabelled samples in the dataset and, (2) identifying the most informative samples, which when included in training would provide maximal model performance lift. We demonstrate the efficacy of our framework on public as well as private enterprise datasets of two Fortune 500 companies, and are confident this work will form the basis for ML teams to perform more intelligent data discovery and pruning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2022

CLIP: Train Faster with Less Data

Deep learning models require an enormous amount of data for training. Ho...
research
10/25/2017

User-centric Composable Services: A New Generation of Personal Data Analytics

Machine Learning (ML) techniques, such as Neural Network, are widely use...
research
11/19/2021

Data Excellence for AI: Why Should You Care

The efficacy of machine learning (ML) models depends on both algorithms ...
research
09/20/2023

Dataset Factory: A Toolchain For Generative Computer Vision Datasets

Generative AI workflows heavily rely on data-centric tasks - such as fil...
research
09/29/2021

PyHard: a novel tool for generating hardness embeddings to support data-centric analysis

For building successful Machine Learning (ML) systems, it is imperative ...
research
11/03/2020

Ensuring Dataset Quality for Machine Learning Certification

In this paper, we address the problem of dataset quality in the context ...
research
12/01/2022

Explainable Artificial Intelligence for Improved Modeling of Processes

In modern business processes, the amount of data collected has increased...

Please sign up or login with your details

Forgot password? Click here to reset