BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context

09/25/2020
by   Jean-Philippe Corbeil, et al.
0

Newly-introduced deep learning architectures, namely BERT, XLNet, RoBERTa and ALBERT, have been proved to be robust on several NLP tasks. However, the datasets trained on these architectures are fixed in terms of size and generalizability. To relieve this issue, we apply one of the most inexpensive solutions to update these datasets. We call this approach BET by which we analyze the backtranslation data augmentation on the transformer-based architectures. Using the Google Translate API with ten intermediary languages from ten different language families, we externally evaluate the results in the context of automatic paraphrase identification in a transformer-based framework. Our findings suggest that BET improves the paraphrase identification performance on the Microsoft Research Paraphrase Corpus (MRPC) to more than 3 on both accuracy and F1 score. We also analyze the augmentation in the low-data regime with downsampled versions of MRPC, Twitter Paraphrase Corpus (TPC) and Quora Question Pairs. In many low-data cases, we observe a switch from a failing model on the test set to reasonable performances. The results demonstrate that BET is a highly promising data augmentation technique: to push the current state-of-the-art of existing datasets and to bootstrap the utilization of deep learning architectures in the low-data regime of a hundred samples.

READ FULL TEXT
research
02/22/2023

Data Augmentation for Neural NLP

Data scarcity is a problem that occurs in languages and tasks where we d...
research
10/05/2020

Mixup-Transfomer: Dynamic Data Augmentation for NLP Tasks

Mixup is the latest data augmentation technique that linearly interpolat...
research
09/09/2023

Distributional Data Augmentation Methods for Low Resource Language

Text augmentation is a technique for constructing synthetic data from an...
research
10/14/2021

Context-gloss Augmentation for Improving Word Sense Disambiguation

The goal of Word Sense Disambiguation (WSD) is to identify the sense of ...
research
02/11/2022

HaT5: Hate Language Identification using Text-to-Text Transfer Transformer

We investigate the performance of a state-of-the art (SoTA) architecture...
research
06/11/2021

Efficient Deep Learning Architectures for Fast Identification of Bacterial Strains in Resource-Constrained Devices

This work presents twelve fine-tuned deep learning architectures to solv...

Please sign up or login with your details

Forgot password? Click here to reset