Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment

06/11/2021
by   Zewen Chi, et al.
0

The cross-lingual language models are typically pretrained with masked language modeling on multilingual text or parallel sentences. In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. Specifically, the model first self-labels word alignments for parallel sentences. Then we randomly mask tokens in a bitext pair. Given a masked token, the model uses a pointer network to predict the aligned token in the other language. We alternately perform the above two steps in an expectation-maximization manner. Experimental results show that our method improves cross-lingual transferability on various datasets, especially on the token-level tasks, such as question answering, and structured prediction. Moreover, the model can serve as a pretrained word aligner, which achieves reasonably low error rates on the alignment benchmarks. The code and pretrained parameters are available at https://github.com/CZWin32768/XLM-Align.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/30/2021

XLM-E: Cross-lingual Language Model Pre-training via ELECTRA

In this paper, we introduce ELECTRA-style tasks to cross-lingual languag...
research
05/16/2023

Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

Recent studies have shown that dual encoder models trained with the sent...
research
11/08/2022

Third-Party Aligner for Neural Word Alignments

Word alignment is to find translationally equivalent words between sourc...
research
05/22/2023

Towards Unsupervised Recognition of Semantic Differences in Related Documents

Automatically highlighting words that cause semantic differences between...
research
03/16/2022

Graph Neural Networks for Multiparallel Word Alignment

After a period of decrease, interest in word alignments is increasing ag...
research
07/17/2023

A mixed policy to improve performance of language models on math problems

When to solve math problems, most language models take a sampling strate...
research
12/04/2022

Cross-lingual Similarity of Multilingual Representations Revisited

Related works used indexes like CKA and variants of CCA to measure the s...

Please sign up or login with your details

Forgot password? Click here to reset