Augmenting Passage Representations with Query Generation for Enhanced Cross-Lingual Dense Retrieval

by   Shengyao Zhuang, et al.

Effective cross-lingual dense retrieval methods that rely on multilingual pre-trained language models (PLMs) need to be trained to encompass both the relevance matching task and the cross-language alignment task. However, cross-lingual data for training is often scarcely available. In this paper, rather than using more cross-lingual data for training, we propose to use cross-lingual query generation to augment passage representations with queries in languages other than the original passage language. These augmented representations are used at inference time so that the representation can encode more information across the different target languages. Training of a cross-lingual query generator does not require additional training data to that used for the dense retriever. The query generator training is also effective because the pre-training task for the generator (T5 text-to-text training) is very similar to the fine-tuning task (generation of a query). The use of the generator does not increase query latency at inference and can be combined with any cross-lingual dense retrieval method. Results from experiments on a benchmark cross-lingual information retrieval dataset show that our approach can improve the effectiveness of existing cross-lingual dense retrieval methods. Implementation of our methods, along with all generated query files are made publicly available at


page 1

page 2

page 3

page 4


Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

We present Unicoder, a universal language encoder that is insensitive to...

Empowering Dual-Encoder with Query Generator for Cross-Lingual Dense Retrieval

In monolingual dense retrieval, lots of works focus on how to distill kn...

NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval

This paper reports on a study of cross-lingual information retrieval (CL...

Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

The advent of multilingual language models has generated a resurgence of...

A Dataset and Baselines for Multilingual Reply Suggestion

Reply suggestion models help users process emails and chats faster. Prev...

Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation

The Differentiable Search Index (DSI) is a new, emerging paradigm for in...

Modeling the Music Genre Perception across Language-Bound Cultures

The music genre perception expressed through human annotations of artist...

Please sign up or login with your details

Forgot password? Click here to reset