CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP

06/11/2020
by   Libo Qin, et al.
0

Multi-lingual contextualized embeddings, such as multilingual-BERT (mBERT), have shown success in a variety of zero-shot cross-lingual tasks. However, these models are limited by having inconsistent contextualized representations of subwords across different languages. Existing work addresses this issue by bilingual projection and fine-tuning technique. We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT, which encourages model to align representations from source and multiple target languages once by mixing their context information. Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages. Experimental results on five tasks with 19 languages show that our method leads to significantly improved performances for all the tasks compared with mBERT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/04/2023

DiTTO: A Feature Representation Imitation Approach for Improving Cross-Lingual Transfer

Zero-shot cross-lingual transfer is promising, however has been shown to...
research
05/27/2023

Why Does Zero-Shot Cross-Lingual Generation Fail? An Explanation and a Solution

Zero-shot cross-lingual transfer is when a multilingual model is trained...
research
08/02/2022

Multilingual Coreference Resolution in Multiparty Dialogue

Existing multiparty dialogue datasets for coreference resolution are nas...
research
05/09/2023

Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data

Transferring information retrieval (IR) models from a high-resource lang...
research
10/22/2022

EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching

Accurate alignment between languages is fundamental for improving cross-...
research
03/01/2021

On the Effectiveness of Dataset Embeddings in Mono-lingual,Multi-lingual and Zero-shot Conditions

Recent complementary strands of research have shown that leveraging info...
research
03/05/2023

WADER at SemEval-2023 Task 9: A Weak-labelling framework for Data augmentation in tExt Regression Tasks

Intimacy is an essential element of human relationships and language is ...

Please sign up or login with your details

Forgot password? Click here to reset