EntityCS: Improving Zero-Shot Cross-lingual Transfer with Entity-Centric Code Switching

10/22/2022
by   Chenxi Whitehouse, et al.
0

Accurate alignment between languages is fundamental for improving cross-lingual pre-trained language models (XLMs). Motivated by the natural phenomenon of code-switching (CS) in multilingual speakers, CS has been used as an effective data augmentation method that offers language alignment at word- or phrase-level, in contrast to sentence-level via parallel instances. Existing approaches either use dictionaries or parallel sentences with word-alignment to generate CS data by randomly switching words in a sentence. However, such methods can be suboptimal as dictionaries disregard semantics, and syntax might become invalid after random word switching. In this work, we propose EntityCS, a method that focuses on Entity-level Code-Switching to capture fine-grained cross-lingual semantics without corrupting syntax. We use Wikidata and the English Wikipedia to construct an entity-centric CS corpus by switching entities to their counterparts in other languages. We further propose entity-oriented masking strategies during intermediate model training on the EntityCS corpus for improving entity prediction. Evaluation of the trained models on four entity-centric downstream tasks shows consistent improvements over the baseline with a notable increase of 10 the corpus and models to assist research on code-switching and enriching XLMs with external knowledge.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2023

Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer

Zero-shot cross-lingual transfer is a central task in multilingual NLP, ...
research
01/26/2021

El Volumen Louder Por Favor: Code-switching in Task-oriented Semantic Parsing

Being able to parse code-switched (CS) utterances, such as Spanish+Engli...
research
02/26/2022

Multi-Level Contrastive Learning for Cross-Lingual Alignment

Cross-language pre-trained models such as multilingual BERT (mBERT) have...
research
06/11/2020

CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP

Multi-lingual contextualized embeddings, such as multilingual-BERT (mBER...
research
09/29/2020

Cross-lingual Alignment Methods for Multilingual BERT: A Comparative Study

Multilingual BERT (mBERT) has shown reasonable capability for zero-shot ...
research
10/15/2021

Cross-Lingual Fine-Grained Entity Typing

The growth of cross-lingual pre-trained models has enabled NLP tools to ...
research
12/01/2021

Domain-oriented Language Pre-training with Adaptive Hybrid Masking and Optimal Transport Alignment

Motivated by the success of pre-trained language models such as BERT in ...

Please sign up or login with your details

Forgot password? Click here to reset