OpenDataVal: a Unified Benchmark for Data Valuation

06/18/2023
by   Kevin Fu Jiang, et al.
0

Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of nine different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches. We find that no single algorithm performs uniformly best across all tasks, and an appropriate algorithm should be employed for a user's downstream task. OpenDataVal is publicly available at https://opendataval.github.io with comprehensive documentation. Furthermore, we provide a leaderboard where researchers can evaluate the effectiveness of their own data valuation algorithms.

READ FULL TEXT
research
08/02/2022

ferret: a Framework for Benchmarking Explainers on Transformers

Many interpretability tools allow practitioners and researchers to expla...
research
07/14/2023

A Dynamic Points Removal Benchmark in Point Cloud Maps

In the field of robotics, the point cloud has become an essential map re...
research
12/06/2016

Superpixels: An Evaluation of the State-of-the-Art

Superpixels group perceptually similar pixels to create visually meaning...
research
06/10/2021

A Unified Framework for Task-Driven Data Quality Management

High-quality data is critical to train performant Machine Learning (ML) ...
research
10/26/2021

Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning

Data Shapley has recently been proposed as a principled framework to qua...
research
03/03/2020

Image Matching across Wide Baselines: From Paper to Practice

We introduce a comprehensive benchmark for local features and robust est...
research
06/22/2022

The ArtBench Dataset: Benchmarking Generative Models with Artworks

We introduce ArtBench-10, the first class-balanced, high-quality, cleanl...

Please sign up or login with your details

Forgot password? Click here to reset