Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions

06/07/2023
by   John Joon Young Chung, et al.
0

Large language models (LLMs) can be used to generate text data for training and evaluating other models. However, creating high-quality datasets with LLMs can be challenging. In this work, we explore human-AI partnerships to facilitate high diversity and accuracy in LLM-based text data generation. We first examine two approaches to diversify text generation: 1) logit suppression, which minimizes the generation of languages that have already been frequently generated, and 2) temperature sampling, which flattens the token sampling probability. We found that diversification approaches can increase data diversity but often at the cost of data accuracy (i.e., text and labels being appropriate for the target domain). To address this issue, we examined two human interventions, 1) label replacement (LR), correcting misaligned labels, and 2) out-of-scope filtering (OOSF), removing instances that are out of the user's domain of interest or to which no considered label applies. With oracle studies, we found that LR increases the absolute accuracy of models trained with diversified datasets by 14.4 trained with data generated with LR interventions outperformed LLM-based few-shot classification. In contrast, OOSF was not effective in increasing model accuracy, implying the need for future work in human-in-the-loop text data generation.

READ FULL TEXT

page 5

page 13

page 15

page 17

research
03/13/2021

Improving Diversity of Neural Text Generation via Inverse Probability Weighting

The neural network based text generation suffers from the text degenerat...
research
07/07/2021

On Training Instance Selection for Few-Shot Neural Text Generation

Large-scale pretrained language models have led to dramatic improvements...
research
08/14/2021

The SelectGen Challenge: Finding the Best Training Samples for Few-Shot Neural Text Generation

We propose a shared task on training instance selection for few-shot neu...
research
06/09/2022

Factuality Enhanced Language Models for Open-Ended Text Generation

Pretrained language models (LMs) are susceptible to generate text with n...
research
03/30/2022

Neural Pipeline for Zero-Shot Data-to-Text Generation

In data-to-text (D2T) generation, training on in-domain data leads to ov...
research
05/19/2023

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

Large language models (LLMs) can be used to generate smaller, more refin...
research
05/23/2023

Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations

Large pre-trained language models have exhibited unprecedented capabilit...

Please sign up or login with your details

Forgot password? Click here to reset