Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

03/07/2023
by   Martin Josifoski, et al.
0

Large language models (LLMs) show great potential for synthetic data generation. This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by the LLM: we show that, for problems with structured outputs, it is possible to prompt an LLM to perform the task in the opposite direction, to generate plausible text for the target structure. Leveraging the asymmetry in task difficulty makes it possible to produce large-scale, high-quality data for complex tasks. We demonstrate the effectiveness of this approach on closed information extraction, where collecting ground-truth data is challenging, and no satisfactory dataset exists to date. We synthetically generate a dataset of 1.8M data points, demonstrate its superior quality compared to existing datasets in a human evaluation and use it to finetune small models (220M and 770M parameters). The models we introduce, SynthIE, outperform existing baselines of comparable size with a substantial gap of 57 and 79 absolute points in micro and macro F1, respectively. Code, data, and models are available at https://github.com/epfl-dlab/SynthIE.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/23/2023

Generating Data for Symbolic Language with Large Language Models

While large language models (LLMs) bring not only performance but also c...
research
11/25/2022

CodeExp: Explanatory Code Document Generation

Developing models that can automatically generate detailed code explanat...
research
12/04/2022

Brain Tumor Synthetic Data Generation with Adaptive StyleGANs

Generative models have been very successful over the years and have rece...
research
10/11/2022

BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset

In this work, we present BanglaParaphrase, a high-quality synthetic Bang...
research
05/12/2022

Falsesum: Generating Document-level NLI Examples for Recognizing Factual Inconsistency in Summarization

Neural abstractive summarization models are prone to generate summaries ...
research
09/13/2022

PointScatter: Point Set Representation for Tubular Structure Extraction

This paper explores the point set representation for tubular structure e...
research
08/22/2023

An extensible point-based method for data chart value detection

We present an extensible method for identifying semantic points to rever...

Please sign up or login with your details

Forgot password? Click here to reset