Synthetic Cross-language Information Retrieval Training Data

04/29/2023
by   James Mayfield, et al.
0

A key stumbling block for neural cross-language information retrieval (CLIR) systems has been the paucity of training data. The appearance of the MS MARCO monolingual training set led to significant advances in the state of the art in neural monolingual retrieval. By translating the MS MARCO documents into other languages using machine translation, this resource has been made useful to the CLIR community. Yet such translation suffers from a number of problems. While MS MARCO is a large resource, it is of fixed size; its genre and domain of discourse are fixed; and the translated documents are not written in the language of a native speaker of the language, but rather in translationese. To address these problems, we introduce the JH-POLO CLIR training set creation methodology. The approach begins by selecting a pair of non-English passages. A generative large language model is then used to produce an English query for which the first passage is relevant and the second passage is not relevant. By repeating this process, collections of arbitrary size can be created in the style of MS MARCO but using naturally-occurring documents in any desired genre and domain of discourse. This paper describes the methodology in detail, shows its use in creating new CLIR training sets, and describes experiments using the newly created training data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/10/2023

Naver Labs Europe (SPLADE) @ TREC NeuCLIR 2022

This paper describes our participation in the 2022 TREC NeuCLIR challeng...
research
09/03/2022

Multilingual ColBERT-X

ColBERT-X is a dense retrieval model for Cross Language Information Retr...
research
01/20/2022

Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models

The advent of transformer-based models such as BERT has led to the rise ...
research
05/22/2016

Automatic Construction of Discourse Corpora for Dialogue Translation

In this paper, a novel approach is proposed to automatically construct p...
research
05/25/2016

Dimension Projection among Languages based on Pseudo-relevant Documents for Query Translation

Using top-ranked documents in response to a query has been shown to be a...
research
09/01/2018

Simple Fusion: Return of the Language Model

Neural Machine Translation (NMT) typically leverages monolingual data in...

Please sign up or login with your details

Forgot password? Click here to reset