Towards Best Practices for Training Multilingual Dense Retrieval Models

04/05/2022
by   Xinyu Zhang, et al.
0

Dense retrieval models using a transformer-based bi-encoder design have emerged as an active area of research. In this work, we focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design. Although recent work with multilingual transformers demonstrates that they exhibit strong cross-lingual generalization capabilities, there remain many open research questions, which we tackle here. Our study is organized as a "best practices" guide for training multilingual dense retrieval models, broken down into three main scenarios: where a multilingual transformer is available, but relevance judgments are not available in the language of interest; where both models and training data are available; and, where training data are available not but models. In considering these scenarios, we gain a better understanding of the role of multi-stage fine-tuning, the strength of cross-lingual transfer under various conditions, the usefulness of out-of-language data, and the advantages of multilingual vs. monolingual transformers. Our recommendations offer a guide for practitioners building search applications, particularly for low-resource languages, and while our work leaves open a number of research questions, we provide a solid foundation for future work.

READ FULL TEXT

page 6

page 8

research
07/26/2021

One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval

We present CORA, a Cross-lingual Open-Retrieval Answer Generation model ...
research
11/05/2019

Unsupervised Cross-lingual Representation Learning at Scale

This paper shows that pretraining multilingual language models at scale ...
research
05/25/2023

Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages

While impressive performance has been achieved on the task of Answer Sen...
research
05/29/2019

Regularization Advantages of Multilingual Neural Language Models for Low Resource Domains

Neural language modeling (LM) has led to significant improvements in sev...
research
05/30/2022

ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System

This paper introduces our proposed system for the MIA Shared Task on Cro...
research
07/19/2022

On the Usability of Transformers-based models for a French Question-Answering task

For many tasks, state-of-the-art results have been achieved with Transfo...
research
10/13/2022

A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models

Recent work on tokenizer-free multilingual pretrained models show promis...

Please sign up or login with your details

Forgot password? Click here to reset