Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

02/03/2023
by   Shunyu Zhang, et al.
0

Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach. Code and model will be available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/11/2023

A General-Purpose Multilingual Document Encoder

Massively multilingual pretrained transformers (MMTs) have tremendously ...
research
11/11/2022

English Contrastive Learning Can Learn Universal Cross-lingual Sentence Embeddings

Universal cross-lingual sentence embeddings map semantically similar cro...
research
05/17/2022

SAMU-XLSR: Semantically-Aligned Multimodal Utterance-level Cross-Lingual Speech Representation

We propose the SAMU-XLSR: Semantically-Aligned Multimodal Utterance-leve...
research
02/20/2020

Contextual Lensing of Universal Sentence Representations

What makes a universal sentence encoder universal? The notion of a gener...
research
05/31/2022

EMS: Efficient and Effective Massively Multilingual Sentence Representation Learning

Massively multilingual sentence representation models, e.g., LASER, SBER...
research
01/21/2021

Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

Pretrained multilingual text encoders based on neural Transformer archit...
research
02/24/2023

Cross-Lingual Transfer of Cognitive Processing Complexity

When humans read a text, their eye movements are influenced by the struc...

Please sign up or login with your details

Forgot password? Click here to reset