MFAQ: a Multilingual FAQ Dataset

09/27/2021
by   Maxime De Bruyn, et al.
0

In this paper, we present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its own challenges: duplication of content and uneven distribution of topics. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset. Our experiments reveal that a multilingual model based on XLM-RoBERTa achieves the best results, except for English. Lower resources languages seem to learn from one another as a multilingual model achieves a higher MRR than language-specific ones. Our qualitative analysis reveals the brittleness of the model on simple word changes. We publicly release our dataset, model and training script.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/30/2021

MTVR: Multilingual Moment Retrieval in Videos

We introduce mTVR, a large-scale multilingual video moment retrieval dat...
06/03/2021

A Dataset and Baselines for Multilingual Reply Suggestion

Reply suggestion models help users process emails and chats faster. Prev...
12/28/2020

Pivot Through English: Reliably Answering Multilingual Questions without Document Retrieval

Existing methods for open-retrieval question answering in lower resource...
10/13/2015

Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning

Recently there has been a lot of interest in learning common representat...
01/10/2022

Language-Agnostic Website Embedding and Classification

Currently, publicly available models for website classification do not o...
06/01/2019

"President Vows to Cut <Taxes> Hair": Dataset and Analysis of Creative Text Editing for Humorous Headlines

We introduce, release, and analyze a new dataset, called Humicroedit, fo...
09/15/2021

Learning to Match Job Candidates Using Multilingual Bi-Encoder BERT

In this talk, we will show how we used Randstad history of candidate pla...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.