Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation

01/29/2023
by   Zhiqi Huang, et al.
0

Benefiting from transformer-based pre-trained language models, neural ranking models have made significant progress. More recently, the advent of multilingual pre-trained language models provides great support for designing neural cross-lingual retrieval models. However, due to unbalanced pre-training data in different languages, multilingual language models have already shown a performance gap between high and low-resource languages in many downstream tasks. And cross-lingual retrieval models built on such pre-trained models can inherit language bias, leading to suboptimal result for low-resource languages. Moreover, unlike the English-to-English retrieval task, where large-scale training collections for document ranking such as MS MARCO are available, the lack of cross-lingual retrieval data for low-resource language makes it more challenging for training cross-lingual retrieval models. In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval. To transfer a model from high to low resource languages, OPTICAL forms the cross-lingual token alignment task as an optimal transport problem to learn from a well-trained monolingual retrieval model. By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training, which is more feasible for low-resource languages. Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages, including neural machine translation.

READ FULL TEXT
research
07/01/2023

Low-Resource Cross-Lingual Adaptive Training for Nigerian Pidgin

Developing effective spoken language processing systems for low-resource...
research
06/08/2019

Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations

In this paper, we propose to boost low-resource cross-lingual document r...
research
02/26/2023

Cross-lingual Knowledge Transfer via Distillation for Multilingual Information Retrieval

In this paper, we introduce the approach behind our submission for the M...
research
10/31/2018

Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog

One of the first steps in the utterance interpretation pipeline of many ...
research
05/17/2020

Cross-Lingual Low-Resource Set-to-Description Retrieval for Global E-Commerce

With the prosperous of cross-border e-commerce, there is an urgent deman...
research
12/23/2021

Do Multi-Lingual Pre-trained Language Models Reveal Consistent Token Attributions in Different Languages?

During the past several years, a surge of multi-lingual Pre-trained Lang...
research
10/25/2022

Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport

Bilingual lexicons form a critical component of various natural language...

Please sign up or login with your details

Forgot password? Click here to reset