Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

08/19/2021
by   Xinyu Zhang, et al.
0

We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is to spur research in dense retrieval techniques in non-English languages, motivated by recent observations that existing techniques for representation learning perform poorly when applied to out-of-distribution data. As a starting point, we provide zero-shot baselines for this new dataset based on a multi-lingual adaptation of DPR that we call "mDPR". Experiments show that although the effectiveness of mDPR is much lower than BM25, dense representations nevertheless appear to provide valuable relevance signals, improving BM25 results in sparse-dense hybrids. In addition to analyses of our results, we also discuss future challenges and present a research agenda in multi-lingual dense retrieval. Mr. TyDi can be downloaded at https://github.com/castorini/mr.tydi.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/08/2019

Cross-Lingual Relevance Transfer for Document Retrieval

Recent work has shown the surprising ability of multi-lingual BERT to se...
research
11/24/2020

Towards Zero-shot Cross-lingual Image Retrieval

There has been a recent spike in interest in multi-modal Language and Vi...
research
09/15/2021

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

There has been a recent spike in interest in multi-modal Language and Vi...
research
06/07/2022

Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval

Recent research demonstrates the effectiveness of using pretrained langu...
research
10/26/2022

Multi-lingual Evaluation of Code Generation Models

We present MBXP, an execution-based code completion benchmark in 10+ pro...
research
07/25/2023

Combating the Curse of Multilinguality in Cross-Lingual WSD by Aligning Sparse Contextualized Word Representations

In this paper, we advocate for using large pre-trained monolingual langu...
research
03/01/2021

On the Effectiveness of Dataset Embeddings in Mono-lingual,Multi-lingual and Zero-shot Conditions

Recent complementary strands of research have shown that leveraging info...

Please sign up or login with your details

Forgot password? Click here to reset