Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval

06/07/2022
by   Ning Wu, et al.
0

Recent research demonstrates the effectiveness of using pretrained language models (PLM) to improve dense retrieval and multilingual dense retrieval. In this work, we present a simple but effective monolingual pretraining task called contrastive context prediction (CCP) to learn sentence representation by modeling sentence level contextual relation. By pushing the embedding of sentences in a local context closer and pushing random negative samples away, different languages could form isomorphic structure, then sentence pairs in two different languages will be automatically aligned. Our experiments show that model collapse and information leakage are very easy to happen during contrastive training of language model, but language-specific memory bank and asymmetric batch normalization operation play an essential role in preventing collapsing and information leakage, respectively. Besides, a post-processing for sentence embedding is also very effective to achieve better retrieval performance. On the multilingual sentence retrieval task Tatoeba, our model achieves new SOTA results among methods without using bilingual data. Our model also shows larger gain on Tatoeba when transferring between non-English pairs. On two multi-lingual query-passage retrieval tasks, XOR Retrieve and Mr.TYDI, our model even achieves two SOTA results in both zero-shot and supervised setting among all pretraining models using bilingual data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2022

Parameter-efficient Zero-shot Transfer for Cross-Language Dense Retrieval with Adapters

A popular approach to creating a zero-shot cross-language retrieval mode...
research
10/23/2020

Multilingual BERT Post-Pretraining Alignment

We propose a simple method to align multilingual contextual embeddings a...
research
10/14/2021

Representation Decoupling for Open-Domain Passage Retrieval

Training dense passage representations via contrastive learning (CL) has...
research
12/21/2021

On Cross-Lingual Retrieval with Multilingual Text Encoders

In this work we present a systematic empirical study focused on the suit...
research
08/19/2021

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual ...
research
03/30/2022

Auto-MLM: Improved Contrastive Learning for Self-supervised Multi-lingual Knowledge Retrieval

Contrastive learning (CL) has become a ubiquitous approach for several n...
research
05/31/2023

Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data

This paper presents Structure Aware Dense Retrieval (SANTA) model, which...

Please sign up or login with your details

Forgot password? Click here to reset