Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data

05/09/2023
by   Robert Litschko, et al.
0

Transferring information retrieval (IR) models from a high-resource language (typically English) to other languages in a zero-shot fashion has become a widely adopted approach. In this work, we show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched data instead, which we generate by utilizing bilingual lexicons. To this end, we experiment with lexicons induced from (1) cross-lingual word embeddings and (2) parallel Wikipedia page titles. We use the mMARCO dataset to extensively evaluate reranking models on 36 language pairs spanning Monolingual IR (MoIR), Cross-lingual IR (CLIR), and Multilingual IR (MLIR). Our results show that code-switching can yield consistent and substantial gains of 5.1 MRR@10 in CLIR and 3.9 MRR@10 in MLIR, while maintaining stable performance in MoIR. Encouragingly, the gains are especially pronounced for distant languages (up to 2x absolute gain). We further show that our approach is robust towards the ratio of code-switched tokens and also extends to unseen languages. Our results demonstrate that training on code-switched data is a cheap and effective way of generalizing zero-shot rankers for cross-lingual and multilingual retrieval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/17/2021

Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training

In recent years, pre-trained multilingual language models, such as multi...
research
10/18/2022

A Simple and Effective Method to Improve Zero-Shot Cross-Lingual Transfer Learning

Existing zero-shot cross-lingual transfer methods rely on parallel corpo...
research
09/15/2021

Towards Zero-shot Cross-lingual Image Retrieval and Tagging

There has been a recent spike in interest in multi-modal Language and Vi...
research
10/30/2020

Embedding Meta-Textual Information for Improved Learning to Rank

Neural approaches to learning term embeddings have led to improved compu...
research
06/11/2020

CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP

Multi-lingual contextualized embeddings, such as multilingual-BERT (mBER...
research
08/26/2023

ZC3: Zero-Shot Cross-Language Code Clone Detection

Developers introduce code clones to improve programming productivity. Ma...
research
02/23/2022

Zero-shot Cross-lingual Transfer of Prompt-based Tuning with a Unified Multilingual Prompt

Prompt-based tuning has been proven effective for pretrained language mo...

Please sign up or login with your details

Forgot password? Click here to reset