ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair

05/10/2022
by   Alham Fikri Aji, et al.
0

We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples using beam search and choose the most lexically diverse pair according to their sentence BLEU. We compare our generated corpus with the . According to our evaluation, our synthetic paraphrase pairs are semantically similar and lexically diverse.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/11/2015

On Using Monolingual Corpora in Neural Machine Translation

Recent work on end-to-end neural network-based architectures for machine...
research
08/28/2018

Understanding Back-Translation at Scale

An effective method to improve neural machine translation with monolingu...
research
05/18/2020

NEJM-enzh: A Parallel Corpus for English-Chinese Translation in the Biomedical Domain

Machine translation requires large amounts of parallel text. While such ...
research
11/10/2019

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

We show that margin-based bitext mining in a multilingual sentence space...
research
12/17/2022

A Simple Baseline for Beam Search Reranking

Reranking methods in machine translation aim to close the gap between co...
research
09/11/2015

A Parallel Corpus of Translationese

We describe a set of bilingual English--French and English--German paral...
research
09/07/2021

Paraphrase Generation as Unsupervised Machine Translation

In this paper, we propose a new paradigm for paraphrase generation by tr...

Please sign up or login with your details

Forgot password? Click here to reset