Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

05/24/2023
by   Veniamin Veselovsky, et al.
0

Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.

READ FULL TEXT
research
01/31/2022

Reducing the Amount of Real World Data for Object Detector Training with Synthetic Data

A number of studies have investigated the training of neural networks wi...
research
02/08/2023

Machine Learning for Synthetic Data Generation: a Review

Data plays a crucial role in machine learning. However, in real-world ap...
research
08/07/2023

Simple synthetic data reduces sycophancy in large language models

Sycophancy is an undesirable behavior where models tailor their response...
research
01/24/2023

Generating Multidimensional Clusters With Support Lines

Synthetic data is essential for assessing clustering techniques, complem...
research
05/04/2021

Out-of-distribution Detection and Generation using Soft Brownian Offset Sampling and Autoencoders

Deep neural networks often suffer from overconfidence which can be partl...
research
08/23/2023

Augmenting medical image classifiers with synthetic data from latent diffusion models

While hundreds of artificial intelligence (AI) algorithms are now approv...

Please sign up or login with your details

Forgot password? Click here to reset