Embedding-Enhanced Giza++: Improving Alignment in Low- and High- Resource Scenarios Using Embedding Space Geometry

04/18/2021
by   Kelly Marchisio, et al.
0

A popular natural language processing task decades ago, word alignment has been dominated until recently by GIZA++, a statistical method based on the 30-year-old IBM models. Though recent years have finally seen Giza++ performance bested, the new methods primarily rely on large machine translation models, massively multilingual language models, or supervision from Giza++ alignments itself. We introduce Embedding-Enhanced Giza++, and outperform Giza++ without any of the aforementioned factors. Taking advantage of monolingual embedding space geometry of the source and target language only, we exceed Giza++'s performance in every tested scenario for three languages. In the lowest-resource scenario of only 500 lines of bitext, we improve performance over Giza++ by 10.9 AER. Our method scales monotonically outperforming Giza++ for all tested scenarios between 500 and 1.9 million lines of bitext. Our code will be made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/21/2018

A Universal Semantic Space

Multilingual embeddings build on the success of monolingual embeddings a...
research
05/23/2023

FOCUS: Effective Embedding Initialization for Specializing Pretrained Multilingual Models on a Single Language

Using model weights pretrained on a high-resource language as a warm sta...
research
08/25/2019

Multilingual Neural Machine Translation with Language Clustering

Multilingual neural machine translation (NMT), which translates multiple...
research
12/19/2014

Embedding Word Similarity with Neural Machine Translation

Neural language models learn word representations, or embeddings, that c...
research
05/06/2020

Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting

Unsupervised machine translation (MT) has recently achieved impressive r...
research
05/09/2022

Sub-Word Alignment Is Still Useful: A Vest-Pocket Method for Enhancing Low-Resource Machine Translation

We leverage embedding duplication between aligned sub-words to extend th...
research
05/30/2023

Stable Anisotropic Regularization

Given the success of Large Language Models (LLMs), there has been consid...

Please sign up or login with your details

Forgot password? Click here to reset