Data Augmentation for Neural Machine Translation using Generative Language Model

07/26/2023
by   Seokjin Oh, et al.
0

Despite the rapid growth in model architecture, the scarcity of large parallel corpora remains the main bottleneck in Neural Machine Translation. Data augmentation is a technique that enhances the performance of data-hungry models by generating synthetic data instead of collecting new ones. We explore prompt-based data augmentation approaches that leverage large-scale language models such as ChatGPT. To create a synthetic parallel corpus, we compare 3 methods using different prompts. We employ two assessment metrics to measure the diversity of the generated synthetic data. This approach requires no further model training cost, which is mandatory in other augmentation methods like back-translation. The proposed method improves the unaugmented baseline by 0.68 BLEU score.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2022

CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation

We propose a novel data-augmentation technique for neural machine transl...
research
07/11/2023

Neural Machine Translation Data Generation and Augmentation using ChatGPT

Neural models have revolutionized the field of machine translation, but ...
research
06/05/2019

Efficient, Lexicon-Free OCR using Deep Learning

Contrary to popular belief, Optical Character Recognition (OCR) remains ...
research
05/24/2023

OverPrompt: Enhancing ChatGPT Capabilities through an Efficient In-Context Learning Approach

The exceptional performance of pre-trained large language models has rev...
research
06/26/2023

Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation

In this paper, we introduce a data-driven approach for Formality-Sensiti...
research
01/30/2023

Standardized CycleGAN training for unsupervised stain adaptation in invasive carcinoma classification for breast histopathology

Generalization is one of the main challenges of computational pathology....
research
09/22/2022

Semantically Consistent Data Augmentation for Neural Machine Translation via Conditional Masked Language Model

This paper introduces a new data augmentation method for neural machine ...

Please sign up or login with your details

Forgot password? Click here to reset