Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

01/21/2021
by   Robert Litschko, et al.
0

Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR – a setup with no relevance judgments for IR-specific fine-tuning – pretrained encoders fail to significantly outperform models based on CLWEs. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders `off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/21/2021

On Cross-Lingual Retrieval with Multilingual Text Encoders

In this work we present a systematic empirical study focused on the suit...
research
05/11/2023

A General-Purpose Multilingual Document Encoder

Massively multilingual pretrained transformers (MMTs) have tremendously ...
research
06/07/2021

LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models

Cross-lingual document representations enable language understanding in ...
research
02/03/2023

Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

Recently multi-lingual pre-trained language models (PLM) such as mBERT a...
research
10/13/2022

A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models

Recent work on tokenizer-free multilingual pretrained models show promis...
research
10/06/2020

Do Explicit Alignments Robustly Improve Multilingual Encoders?

Multilingual BERT (mBERT), XLM-RoBERTa (XLMR) and other unsupervised mul...
research
02/01/2019

How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions

Cross-lingual word embeddings (CLEs) enable multilingual modeling of mea...

Please sign up or login with your details

Forgot password? Click here to reset