A Proposal to Study "Is High Quality Data All We Need?"

03/12/2022
by   Swaroop Mishra, et al.
10

Even though deep neural models have achieved superhuman performance on many popular benchmarks, they have failed to generalize to OOD or adversarial datasets. Conventional approaches aimed at increasing robustness include developing increasingly large models and augmentation with large scale datasets. However, orthogonal to these trends, we hypothesize that a smaller, high quality dataset is what we need. Our hypothesis is based on the fact that deep neural networks are data driven models, and data is what leads/misleads models. In this work, we propose an empirical study that examines how to select a subset of and/or create high quality benchmark data, for a model to learn effectively. We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets. We plan to investigate both data pruning and data creation paradigms to generate high quality datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/06/2020

HooliGAN: Robust, High Quality Neural Vocoding

Recent developments in generative models have shown that deep learning c...
research
01/29/2019

Impact of Training Dataset Size on Neural Answer Selection Models

It is held as a truism that deep neural networks require large datasets ...
research
06/11/2021

HUI-Audio-Corpus-German: A high quality TTS dataset

The increasing availability of audio data on the internet lead to a mult...
research
03/30/2023

The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

Pre-training Large Language Models (LLMs) require massive amounts of tex...
research
02/14/2022

On the Importance of Building High-quality Training Datasets for Neural Code Search

The performance of neural code search is significantly influenced by the...
research
03/13/2023

NeRFLiX: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-viewpoint MiXer

Neural radiance fields (NeRF) show great success in novel view synthesis...
research
06/29/2023

UMASS_BioNLP at MEDIQA-Chat 2023: Can LLMs generate high-quality synthetic note-oriented doctor-patient conversations?

This paper presents UMASS_BioNLP team participation in the MEDIQA-Chat 2...

Please sign up or login with your details

Forgot password? Click here to reset