Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment

05/23/2022
by   Tuan Dinh, et al.
0

Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods. Recent findings have shown that the accuracy and robustness of unsupervised word translation (UWT) can be improved by making use of visual observations, which are universal representations across languages. In this work, we investigate the potential of using not only visual observations but also pretrained language-image models for enabling a more efficient and robust UWT. Specifically, we develop a novel UWT method dubbed Word Alignment using Language-Image Pretraining (WALIP), which leverages visual observations via the shared embedding space of images and texts provided by CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we retrieve word pairs with high confidences of similarity, computed using our proposed image-based fingerprints, which define the initial pivot for the word alignment. Second, we apply our robust Procrustes algorithm to estimate the linear mapping between two embedding spaces, which iteratively corrects and refines the estimated alignment. Our extensive experiments show that WALIP improves upon the state-of-the-art performance of bilingual word alignment for a few language pairs across different word embeddings and displays great robustness to the dissimilarity of language pairs or training corpora for two word embeddings.

READ FULL TEXT
research
01/20/2021

Word Alignment by Fine-tuning Embeddings on Parallel Corpora

Word alignment over parallel corpora has a wide variety of applications,...
research
06/20/2020

Learning aligned embeddings for semi-supervised word translation using Maximum Mean Discrepancy

Word translation is an integral part of language translation. In machine...
research
04/02/2017

Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings

One of the most important problems in machine translation (MT) evaluatio...
research
12/08/2020

Globetrotter: Unsupervised Multilingual Translation from Visual Alignment

Multi-language machine translation without parallel corpora is challengi...
research
01/18/2018

An Iterative Closest Point Method for Unsupervised Word Translation

Unsupervised word translation from non-parallel inter-lingual corpora ha...
research
06/14/2023

Does mBERT understand Romansh? Evaluating word embeddings using word alignment

We test similarity-based word alignment models (SimAlign and awesome-ali...
research
08/31/2020

Discovering Bilingual Lexicons in Polyglot Word Embeddings

Bilingual lexicons and phrase tables are critical resources for modern M...

Please sign up or login with your details

Forgot password? Click here to reset