Auto-MLM: Improved Contrastive Learning for Self-supervised Multi-lingual Knowledge Retrieval

03/30/2022
by   Wenshen Xu, et al.
21

Contrastive learning (CL) has become a ubiquitous approach for several natural language processing (NLP) downstream tasks, especially for question answering (QA). However, the major challenge, how to efficiently train the knowledge retrieval model in an unsupervised manner, is still unresolved. Recently the commonly used methods are composed of CL and masked language model (MLM). Unexpectedly, MLM ignores the sentence-level training, and CL also neglects extraction of the internal info from the query. To optimize the CL hardly obtain internal information from the original query, we introduce a joint training method by combining CL and Auto-MLM for self-supervised multi-lingual knowledge retrieval. First, we acquire the fixed dimensional sentence vector. Then, mask some words among the original sentences with random strategy. Finally, we generate a new token representation for predicting the masked tokens. Experimental results show that our proposed approach consistently outperforms all the previous SOTA methods on both AliExpress & LAZADA service corpus and openly available corpora in 8 languages.

READ FULL TEXT

page 3

page 7

research
04/20/2022

Generative or Contrastive? Phrase Reconstruction for Better Sentence Representation Learning

Though offering amazing contextualized token-level representations, curr...
research
05/16/2020

CERT: Contrastive Self-supervised Learning for Language Understanding

Pretrained language models such as BERT, GPT have shown great effectiven...
research
10/30/2020

SLM: Learning a Discourse Language Representation with Sentence Unshuffling

We introduce Sentence-level Language Modeling, a new pre-training object...
research
06/07/2022

Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval

Recent research demonstrates the effectiveness of using pretrained langu...
research
09/16/2023

Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding

Learning multi-lingual sentence embeddings is a fundamental and signific...
research
09/26/2019

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations ...
research
04/16/2020

A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation

In this paper, we introduce a novel methodology to efficiently construct...

Please sign up or login with your details

Forgot password? Click here to reset