Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

05/25/2022
by   Injy Hamed, et al.
0

Code-switching (CS) poses several challenges to NLP tasks, where data sparsity is a main problem hindering the development of CS NLP systems. In this paper, we investigate data augmentation techniques for synthesizing Dialectal Arabic-English CS text. We perform lexical replacements using parallel corpora and alignments where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We evaluate the effectiveness of data augmentation on language modeling (LM), machine translation (MT), and automatic speech recognition (ASR) tasks. Results show that in the case of using 1-1 alignments, using trained predictive models produces more natural CS sentences, as reflected in perplexity. By relying on grow-diag-final alignments, we then identify aligning segments and perform replacements accordingly. By replacing segments instead of words, the quality of synthesized data is greatly improved. With this improvement, random-based approach outperforms using trained predictive models on all extrinsic tasks. Our best models achieve 33.6 improvement in perplexity, +3.2-5.6 BLEU points on MT task, and 7 improvement on WER for ASR task. We also contribute in filling the gap in resources by collecting and publishing the first Arabic English CS-English parallel corpus.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/29/2021

Investigations on Speech Recognition Systems for Low-Resource Dialectal Arabic-English Code-Switching Speech

Code-switching (CS), defined as the mixing of languages in conversations...
research
09/24/2019

Code-switching Language Modeling With Bilingual Word Embeddings: A Case Study for Egyptian Arabic-English

Code-switching (CS) is a widespread phenomenon among bilingual and multi...
research
10/11/2022

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Data sparsity is one of the main challenges posed by Code-switching (CS)...
research
12/11/2022

End-to-End Speech Translation of Arabic to English Broadcast News

Speech translation (ST) is the task of directly translating acoustic spe...
research
07/12/2020

The ASRU 2019 Mandarin-English Code-Switching Speech Recognition Challenge: Open Datasets, Tracks, Methods and Results

Code-switching (CS) is a common phenomenon and recognizing CS speech is ...
research
11/16/2022

Cognitive Simplification Operations Improve Text Simplification

Text Simplification (TS) is the task of converting a text into a form th...
research
07/12/2021

Accenture at CheckThat! 2021: Interesting claim identification and ranking with contextually sensitive lexical training data augmentation

This paper discusses the approach used by the Accenture Team for CLEF202...

Please sign up or login with your details

Forgot password? Click here to reset