LLM-powered Data Augmentation for Enhanced Crosslingual Performance

05/23/2023
by   Chenxi Whitehouse, et al.
0

This paper aims to explore the potential of leveraging Large Language Models (LLMs) for data augmentation in crosslingual commonsense reasoning datasets, where the available training data is extremely limited. To achieve this, we employ several LLMs including Dolly-v2, StableVicuna, ChatGPT, and GPT-4 to augment three datasets: XCOPA, XWinograd, and XStoryCloze. Subsequently, we assess the effectiveness of fine-tuning smaller crosslingual models, mBERT and XLMR, using the synthesised data. We compare the performance of training with data generated in English and target languages, as well as translating the English-generated data into the target languages. Our experiments reveal the overall advantages of incorporating data generated by LLMs. Training on synthetic data generated by GPT-4, whether English or multilingual, improves performance consistently compared to the baseline. Other models also exhibit an overall increase in performance, however, their effectiveness decreases in some settings. We also ask native speakers to evaluate the naturalness and logical soundness of the generated examples for different languages. Human evaluation reveals that LLMs like ChatGPT and GPT-4 excel at generating natural text in most languages, except a few such as Tamil. Moreover, ChatGPT trails behind in generating plausible alternatives in comparison to the original dataset, while GPT-4 demonstrates competitive logic consistency in the synthesised data.

READ FULL TEXT

page 6

page 8

page 14

research
02/24/2023

HULAT at SemEval-2023 Task 9: Data augmentation for pre-trained transformers applied to Multilingual Tweet Intimacy Analysis

This paper describes our participation in SemEval-2023 Task 9, Intimacy ...
research
06/06/2023

Evaluating the Effectiveness of Natural Language Inference for Hate Speech Detection in Languages with Limited Labeled Data

Most research on hate speech detection has focused on English where a si...
research
10/13/2022

CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing

A bottleneck to developing Semantic Parsing (SP) models is the need for ...
research
04/27/2023

ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT

In this paper, we investigate the use of data obtained from prompting a ...
research
05/03/2023

Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts

We investigate the usefulness of generative Large Language Models (LLMs)...
research
01/13/2021

Improving Commonsense Causal Reasoning by Adversarial Training and Data Augmentation

Determining the plausibility of causal relations between clauses is a co...
research
04/04/2023

Resources and Few-shot Learners for In-context Learning in Slavic Languages

Despite the rapid recent progress in creating accurate and compact in-co...

Please sign up or login with your details

Forgot password? Click here to reset