Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

03/03/2022
by   Vaidehi Patil, et al.
0

Pre-trained multilingual language models such as mBERT and XLM-R have demonstrated great potential for zero-shot cross-lingual transfer to low web-resource languages (LRL). However, due to limited model capacity, the large difference in the sizes of available monolingual corpora between high web-resource languages (HRL) and LRLs does not provide enough scope of co-embedding the LRL with the HRL, thereby affecting downstream task performance of LRLs. In this paper, we argue that relatedness among languages in a language family along the dimension of lexical overlap may be leveraged to overcome some of the corpora limitations of LRLs. We propose Overlap BPE (OBPE), a simple yet effective modification to the BPE vocabulary generation algorithm which enhances overlap across related languages. Through extensive experiments on multiple NLP tasks and datasets, we observe that OBPE generates a vocabulary that increases the representation of LRLs via tokens shared with HRLs. This results in improved zero-shot transfer from related HRLs to LRLs without reducing HRL representation and accuracy. Unlike previous studies that dismissed the importance of token-overlap, we show that in the low-resource related language setting, token overlap matters. Synthetically reducing the overlap to zero can cause as much as a four-fold drop in zero-shot transfer accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2023

Self-Augmentation Improves Zero-Shot Cross-Lingual Transfer

Zero-shot cross-lingual transfer is a central task in multilingual NLP, ...
research
12/19/2022

Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages

Multilingual Pretrained Language Models (MPLMs) have shown their strong ...
research
05/09/2023

Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages

We address the task of machine translation from an extremely low-resourc...
research
01/29/2022

Does Transliteration Help Multilingual Language Modeling?

As there is a scarcity of large representative corpora for most language...
research
11/09/2018

Zero-shot Neural Transfer for Cross-lingual Entity Linking

Cross-lingual entity linking maps an entity mention in a source language...
research
11/09/2020

An Analysis of Dataset Overlap on Winograd-Style Tasks

The Winograd Schema Challenge (WSC) and variants inspired by it have bec...
research
09/09/2023

Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages

Pre-trained multilingual language models underpin a large portion of mod...

Please sign up or login with your details

Forgot password? Click here to reset