Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval

03/27/2023
by   Houxing Ren, et al.
0

Recent multilingual pre-trained models have shown better performance in various multilingual tasks. However, these models perform poorly on multilingual retrieval tasks due to lacking multilingual training data. In this paper, we propose to mine and generate self-supervised training data based on a large-scale unlabeled corpus. We carefully design a mining method which combines the sparse and dense models to mine the relevance of unlabeled queries and passages. And we introduce a query generator to generate more queries in target languages for unlabeled passages. Through extensive experiments on Mr. TYDI dataset and an industrial dataset from a commercial search engine, we demonstrate that our method performs better than baselines based on various pre-trained multilingual models. Our method even achieves on-par performance with the supervised method on the latter dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/11/2022

SEPT: Towards Scalable and Efficient Visual Pre-Training

Recently, the self-supervised pre-training paradigm has shown great pote...
research
11/09/2022

Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models

In this paper, we extend previous self-supervised approaches for languag...
research
07/01/2023

Self-Supervised Query Reformulation for Code Search

Automatic query reformulation is a widely utilized technology for enrich...
research
05/19/2023

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

We improve low-resource ASR by integrating the ideas of multilingual tra...
research
05/04/2022

Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models

The success of multilingual pre-trained models is underpinned by their a...
research
05/15/2023

Soft Prompt Decoding for Multilingual Dense Retrieval

In this work, we explore a Multilingual Information Retrieval (MLIR) tas...
research
07/29/2019

Self-Supervised Learning for Stereo Reconstruction on Aerial Images

Recent developments established deep learning as an inevitable tool to b...

Please sign up or login with your details

Forgot password? Click here to reset