Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

05/16/2023
by   Ziheng Li, et al.
0

Recent studies have shown that dual encoder models trained with the sentence-level translation ranking task are effective methods for cross-lingual sentence embedding. However, our research indicates that token-level alignment is also crucial in multilingual scenarios, which has not been fully explored previously. Based on our findings, we propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding that incorporates both sentence-level and token-level alignment. To achieve this, we introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart. This reconstruction objective encourages the model to embed translation information into the token representation. Compared to other token-level alignment methods such as translation language modeling, RTL is more suitable for dual encoder architectures and is computationally efficient. Extensive experiments on three sentence-level cross-lingual benchmarks demonstrate that our approach can significantly improve sentence embedding. Our code is available at https://github.com/ChillingDream/DAP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/17/2023

VECO 2.0: Cross-lingual Language Model Pre-training with Multi-granularity Contrastive Learning

Recent studies have demonstrated the potential of cross-lingual transfer...
research
12/21/2020

Narrative Incoherence Detection

Motivated by the increasing popularity of intelligent editing assistant,...
research
06/11/2021

Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

The cross-lingual language models are typically pretrained with masked l...
research
05/22/2023

Towards Unsupervised Recognition of Semantic Differences in Related Documents

Automatically highlighting words that cause semantic differences between...
research
05/28/2021

Lightweight Cross-Lingual Sentence Representation Learning

Large-scale models for learning fixed-dimensional cross-lingual sentence...
research
10/18/2022

Synergy with Translation Artifacts for Training and Inference in Multilingual Tasks

Translation has played a crucial role in improving the performance on mu...
research
06/13/2023

Knowledge-Prompted Estimator: A Novel Approach to Explainable Machine Translation Assessment

Cross-lingual Machine Translation (MT) quality estimation plays a crucia...

Please sign up or login with your details

Forgot password? Click here to reset