WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction

06/09/2023
by   Qiyu Wu, et al.
0

Most existing word alignment methods rely on manual alignment datasets or parallel corpora, which limits their usefulness. Here, to mitigate the dependence on manual data, we broaden the source of supervision by relaxing the requirement for correct, fully-aligned, and parallel sentences. Specifically, we make noisy, partially aligned, and non-parallel paragraphs. We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction. Extensive experiments with various settings empirically demonstrate that our approach, which is named WSPAlign, is an effective and scalable way to pre-train word aligners without manual data. When fine-tuned on standard benchmarks, WSPAlign has set a new state-of-the-art by improving upon the best-supervised baseline by 3.3 6.1 points in F1 and 1.5 6.1 points in AER. Furthermore, WSPAlign also achieves competitive performance compared with the corresponding baselines in few-shot, zero-shot and cross-lingual tests, which demonstrates that WSPAlign is potentially more practical for low-resource languages than existing methods.

READ FULL TEXT
research
01/20/2022

Revisiting Weakly Supervised Pre-Training of Visual Perception Models

Model pre-training is a cornerstone of modern visual recognition systems...
research
06/03/2021

Bilingual Alignment Pre-training for Zero-shot Cross-lingual Transfer

Multilingual pre-trained models have achieved remarkable transfer perfor...
research
05/19/2023

Cross-Lingual Supervision improves Large Language Models Pre-training

The recent rapid progress in pre-training Large Language Models has reli...
research
09/29/2020

Cross-lingual Alignment Methods for Multilingual BERT: A Comparative Study

Multilingual BERT (mBERT) has shown reasonable capability for zero-shot ...
research
11/08/2022

Third-Party Aligner for Neural Word Alignments

Word alignment is to find translationally equivalent words between sourc...
research
10/24/2020

Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions

Pre-trained contextual vision-and-language (V L) models have brought i...
research
06/05/2023

End-to-End Word-Level Pronunciation Assessment with MASK Pre-training

Pronunciation assessment is a major challenge in the computer-aided pron...

Please sign up or login with your details

Forgot password? Click here to reset