Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages

05/25/2022
by   Kevin Heffernan, et al.
0

Scaling multilingual representation learning beyond the hundred most frequent languages is challenging, in particular to cover the long tail of low-resource languages. A promising approach has been to train one-for-all multilingual models capable of cross-lingual transfer, but these models often suffer from insufficient capacity and interference between unrelated languages. Instead, we move away from this approach and focus on training multiple language (family) specific representations, but most prominently enable all languages to still be encoded in the same representational space. To achieve this, we focus on teacher-student training, allowing all encoders to be mutually compatible for bitext mining, and enabling fast learning of new languages. We introduce a new teacher-student training scheme which combines supervised and self-supervised training, allowing encoders to take advantage of monolingual training data, which is valuable in the low-resource setting. Our approach significantly outperforms the original LASER encoder. We study very low-resource languages and handle 50 African languages, many of which are not covered by any other model. For these languages, we train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2022

You Can Have Your Data and Balance It Too: Towards Balanced and Efficient Multilingual Models

Multilingual models have been widely used for cross-lingual transfer to ...
research
05/25/2023

Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages

While impressive performance has been achieved on the task of Answer Sen...
research
06/22/2023

xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages

We introduce a new proxy score for evaluating bitext mining based on sim...
research
10/10/2022

Multilingual Representation Distillation with Contrastive Learning

Multilingual sentence representations from large models can encode seman...
research
09/04/2023

NLLB-CLIP – train performant multilingual image retrieval model on a budget

Today, the exponential rise of large models developed by academic and in...
research
04/15/2021

Demystify Optimization Challenges in Multilingual Transformers

Multilingual Transformer improves parameter efficiency and crosslingual ...
research
04/30/2021

Scaling End-to-End Models for Large-Scale Multilingual ASR

Building ASR models across many language families is a challenging multi...

Please sign up or login with your details

Forgot password? Click here to reset