mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset

08/31/2021
by   Luiz Henrique Bonifacio, et al.
0

The MS MARCO ranking dataset has been widely used for training deep learning models for IR tasks, achieving considerable effectiveness on diverse zero-shot scenarios. However, this type of resource is scarce in other languages than English. In this work we present mMARCO, a multilingual version of the MS MARCO passage ranking dataset comprising 8 languages that was created using machine translation. We evaluated mMARCO by fine-tuning mono and multilingual re-ranking models on it. Experimental results demonstrate that multilingual models fine-tuned on our translated dataset achieve superior effectiveness than models fine-tuned on the original English version alone. Also, our distilled multilingual re-ranker is competitive with non-distilled models while having 5.4 times fewer parameters. The translated datasets as well as fine-tuned models are available at https://github.com/unicamp-dl/mMARCO.git.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2022

ÚFAL CorPipe at CRAC 2022: Effectivity of Multilingual Models for Coreference Resolution

We describe the winning submission to the CRAC 2022 Shared Task on Multi...
research
09/27/2022

mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

Robust 2004 is an information retrieval benchmark whose large number of ...
research
04/19/2023

Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent

Large Language Models (LLMs) have demonstrated a remarkable ability to g...
research
03/07/2023

ChatGPT: Beginning of an End of Manual Linguistic Data Annotation? Use Case of Automatic Genre Identification

ChatGPT has shown strong capabilities in natural language generation tas...
research
10/07/2021

mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer

The translation of natural language questions to SQL queries has attract...
research
08/30/2019

PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

Most existing work on adversarial data generation focuses on English. Fo...
research
11/04/2022

Logits are predictive of network type

We show that it is possible to predict which deep network has generated ...

Please sign up or login with your details

Forgot password? Click here to reset