SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

04/18/2020
by   Masoud Jalili Sabet, et al.
0

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are competitive and mostly superior to traditional statistical aligners, even in scenarios with abundant parallel data. For example, for a set of 100k parallel sentences, contextualized embeddings achieve a word alignment F1 for English-German that is more than 5 high quality alignment model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/18/2019

Understanding Neural Machine Translation by Simplification: The Case of Encoder-free Models

In this paper, we try to understand neural machine translation (NMT) via...
research
06/05/2018

How Do Source-side Monolingual Word Embeddings Impact Neural Machine Translation?

Using pre-trained word embeddings as input layer is a common practice in...
research
04/18/2017

An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

End-to-end neural machine translation has overtaken statistical machine ...
research
06/14/2023

Does mBERT understand Romansh? Evaluating word embeddings using word alignment

We test similarity-based word alignment models (SimAlign and awesome-ali...
research
12/16/2022

Homonymy Information for English WordNet

A widely acknowledged shortcoming of WordNet is that it lacks a distinct...
research
03/10/2021

Majority Voting with Bidirectional Pre-translation For Bitext Retrieval

Obtaining high-quality parallel corpora is of paramount importance for t...
research
03/16/2022

Graph Neural Networks for Multiparallel Word Alignment

After a period of decrease, interest in word alignments is increasing ag...

Please sign up or login with your details

Forgot password? Click here to reset