Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks

04/26/2023
by   Anders Giovanni Møller, et al.
0

Obtaining and annotating data can be expensive and time-consuming, especially in complex, low-resource domains. We use GPT-4 and ChatGPT to augment small labeled datasets with synthetic data via simple prompts, in three different classification tasks with varying complexity. For each task, we randomly select a base sample of 500 texts to generate 5,000 new synthetic samples. We explore two augmentation strategies: one that preserves original label distribution and another that balances the distribution. Using a progressively larger training sample size, we train and evaluate a 110M parameter multilingual language model on the real and synthetic data separately. We also test GPT-4 and ChatGPT in a zero-shot setting on the test sets. We observe that GPT-4 and ChatGPT have strong zero-shot performance across all tasks. We find that data augmented with synthetic samples yields a good downstream performance, and particularly aids in low-resource settings, such as in identifying rare classes. Human-annotated data exhibits a strong predictive power, overtaking synthetic data in two out of the three tasks. This finding highlights the need for more complex prompts for synthetic datasets to consistently surpass human-generated ones.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2022

CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing

A bottleneck to developing Semantic Parsing (SP) models is the need for ...
research
06/24/2023

Evaluating the Utility of GAN Generated Synthetic Tabular Data for Class Balancing and Low Resource Settings

The present study aimed to address the issue of imbalanced data in class...
research
09/09/2023

Distributional Data Augmentation Methods for Low Resource Language

Text augmentation is a technique for constructing synthetic data from an...
research
02/25/2022

PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks

This paper focuses on the Data Augmentation for low-resource Natural Lan...
research
05/11/2023

Exploring Zero and Few-shot Techniques for Intent Classification

Conversational NLU providers often need to scale to thousands of intent-...
research
06/09/2021

Tensor feature hallucination for few-shot learning

Few-shot classification addresses the challenge of classifying examples ...
research
05/29/2023

Abstractive Summarization as Augmentation for Document-Level Event Detection

Transformer-based models have consistently produced substantial performa...

Please sign up or login with your details

Forgot password? Click here to reset