Detecting Languages Unintelligible to Multilingual Models through Local Structure Probes

11/09/2022
by   Louis Clouâtre, et al.
0

Providing better language tools for low-resource and endangered languages is imperative for equitable growth. Recent progress with massively multilingual pretrained models has proven surprisingly effective at performing zero-shot transfer to a wide variety of languages. However, this transfer is not universal, with many languages not currently understood by multilingual approaches. It is estimated that only 72 languages possess a "small set of labeled datasets" on which we could test a model's performance, the vast majority of languages not having the resources available to simply evaluate performances on. In this work, we attempt to clarify which languages do and do not currently benefit from such transfer. To that end, we develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model. Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language. We construct a cross-lingual sentence similarity task to evaluate our approach empirically on 350, primarily low-resource, languages.

READ FULL TEXT

page 5

page 8

page 14

page 16

research
04/18/2023

Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese

Multilingual language models have pushed state-of-the-art in cross-lingu...
research
08/20/2020

Inducing Language-Agnostic Multilingual Representations

Multilingual representations have the potential to make cross-lingual sy...
research
06/05/2023

Cross-Lingual Transfer Learning for Phrase Break Prediction with Multilingual Language Model

Phrase break prediction is a crucial task for improving the prosody natu...
research
12/07/2022

JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset

JamPatoisNLI provides the first dataset for natural language inference i...
research
06/03/2021

How to Adapt Your Pretrained Multilingual Model to 1600 Languages

Pretrained multilingual models (PMMs) enable zero-shot learning via cros...
research
12/01/2020

Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios

Existing multilingual speech NLP works focus on a relatively small subse...
research
03/18/2022

Do Multilingual Language Models Capture Differing Moral Norms?

Massively multilingual sentence representations are trained on large cor...

Please sign up or login with your details

Forgot password? Click here to reset