Dataset Factory: A Toolchain For Generative Computer Vision Datasets

09/20/2023
by   Daniel Kharitonov, et al.
0

Generative AI workflows heavily rely on data-centric tasks - such as filtering samples by annotation fields, vector distances, or scores produced by custom classifiers. At the same time, computer vision datasets are quickly approaching petabyte volumes, rendering data wrangling difficult. In addition, the iterative nature of data preparation necessitates robust dataset sharing and versioning mechanisms, both of which are hard to implement ad-hoc. To solve these challenges, we propose a "dataset factory" approach that separates the storage and processing of samples from metadata and enables data-centric operations at scale for machine learning teams and individual researchers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/09/2021

Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development

Data is a crucial component of machine learning. The field is reliant on...
research
04/11/2021

Shuffler: A Large Scale Data Management Tool for ML in Computer Vision

Datasets in the computer vision academic research community are primaril...
research
12/15/2021

Fix your Models by Fixing your Datasets

The quality of underlying training data is very crucial for building per...
research
11/05/2021

Increasing Data Diversity with Iterative Sampling to Improve Performance

As a part of the Data-Centric AI Competition, we propose a data-centric ...
research
12/20/2018

Subsurface structure analysis using computational interpretation and learning: A visual signal processing perspective

Understanding Earth's subsurface structures has been and continues to be...
research
11/02/2022

Hydra – A Federated Data Repository over NDN

Today's big data science communities manage their data publication and r...

Please sign up or login with your details

Forgot password? Click here to reset