ParaAMR: A Large-Scale Syntactically Diverse Paraphrase Dataset by AMR Back-Translation

05/26/2023
by   Kuan-Hao Huang, et al.
5

Paraphrase generation is a long-standing task in natural language processing (NLP). Supervised paraphrase generation models, which rely on human-annotated paraphrase pairs, are cost-inefficient and hard to scale up. On the other hand, automatically annotated paraphrase pairs (e.g., by machine back-translation), usually suffer from the lack of syntactic diversity – the generated paraphrase sentences are very similar to the source sentences in terms of syntax. In this work, we present ParaAMR, a large-scale syntactically diverse paraphrase dataset created by abstract meaning representation back-translation. Our quantitative analysis, qualitative examples, and human evaluation demonstrate that the paraphrases of ParaAMR are syntactically more diverse compared to existing large-scale paraphrase datasets while preserving good semantic similarity. In addition, we show that ParaAMR can be used to improve on three NLP tasks: learning sentence embeddings, syntactically controlled paraphrase generation, and data augmentation for few-shot learning. Our results thus showcase the potential of ParaAMR for improving various NLP applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/26/2021

Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs

Paraphrase generation plays an essential role in natural language proces...
research
11/02/2022

Unsupervised Syntactically Controlled Paraphrase Generation with Abstract Meaning Representations

Syntactically controlled paraphrase generation has become an emerging re...
research
09/07/2021

Paraphrase Generation as Unsupervised Machine Translation

In this paper, we propose a new paradigm for paraphrase generation by tr...
research
01/21/2021

ParaSCI: A Large Scientific Paraphrase Dataset for Longer Paraphrase Generation

We propose ParaSCI, the first large-scale paraphrase dataset in the scie...
research
02/24/2023

STA: Self-controlled Text Augmentation for Improving Text Classifications

Despite recent advancements in Machine Learning, many tasks still involv...
research
12/14/2021

GEO-BLEU: Similarity Measure for Geospatial Sequences

In recent geospatial research, the importance of modeling large-scale hu...
research
02/21/2023

NLPLego: Assembling Test Generation for Natural Language Processing Applications

The development of modern NLP applications often relies on various bench...

Please sign up or login with your details

Forgot password? Click here to reset