MFAQ: a Multilingual FAQ Dataset

09/27/2021
by   Maxime De Bruyn, et al.
0

In this paper, we present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its own challenges: duplication of content and uneven distribution of topics. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset. Our experiments reveal that a multilingual model based on XLM-RoBERTa achieves the best results, except for English. Lower resources languages seem to learn from one another as a multilingual model achieves a higher MRR than language-specific ones. Our qualitative analysis reveals the brittleness of the model on simple word changes. We publicly release our dataset, model and training script.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/30/2021

MTVR: Multilingual Moment Retrieval in Videos

We introduce mTVR, a large-scale multilingual video moment retrieval dat...
research
09/27/2022

mRobust04: A Multilingual Version of the TREC Robust 2004 Benchmark

Robust 2004 is an information retrieval benchmark whose large number of ...
research
06/03/2021

A Dataset and Baselines for Multilingual Reply Suggestion

Reply suggestion models help users process emails and chats faster. Prev...
research
08/23/2022

MATra: A Multilingual Attentive Transliteration System for Indian Scripts

Transliteration is a task in the domain of NLP where the output word is ...
research
01/10/2022

Language-Agnostic Website Embedding and Classification

Currently, publicly available models for website classification do not o...
research
05/13/2023

Multilingual Previously Fact-Checked Claim Retrieval

Fact-checkers are often hampered by the sheer amount of online content t...
research
06/01/2019

"President Vows to Cut <Taxes> Hair": Dataset and Analysis of Creative Text Editing for Humorous Headlines

We introduce, release, and analyze a new dataset, called Humicroedit, fo...

Please sign up or login with your details

Forgot password? Click here to reset