Discovering Bilingual Lexicons in Polyglot Word Embeddings

08/31/2020
by   Ashiqur R. KhudaBukhsh, et al.
0

Bilingual lexicons and phrase tables are critical resources for modern Machine Translation systems. Although recent results show that without any seed lexicon or parallel data, highly accurate bilingual lexicons can be learned using unsupervised methods, such methods rely on the existence of large, clean monolingual corpora. In this work, we utilize a single Skip-gram model trained on a multilingual corpus yielding polyglot word embeddings, and present a novel finding that a surprisingly simple constrained nearest-neighbor sampling technique in this embedding space can retrieve bilingual lexicons, even in harsh social media data sets predominantly written in English and Romanized Hindi and often exhibiting code switching. Our method does not require monolingual corpora, seed lexicons, or any other such resources. Additionally, across three European language pairs, we observe that polyglot word embeddings indeed learn a rich semantic representation of words and substantial bilingual lexicons can be retrieved using our constrained nearest neighbor sampling. We investigate potential reasons and downstream applications in settings spanning both clean texts and noisy social media data sets, and in both resource-rich and under-resourced language pairs.

READ FULL TEXT
research
07/24/2019

Bilingual Lexicon Induction through Unsupervised Machine Translation

A recent research line has obtained strong results on bilingual lexicon ...
research
10/09/2017

Deep Learning Paradigm with Transformed Monolingual Word Embeddings for Multilingual Sentiment Analysis

The surge of social media use brings huge demand of multilingual sentime...
research
12/19/2017

Unsupervised Word Mapping Using Structural Similarities in Monolingual Embeddings

Most existing methods of automatic bilingual dictionary induction rely o...
research
05/23/2017

Second-Order Word Embeddings from Nearest Neighbor Topological Features

We introduce second-order vector representations of words, induced from ...
research
05/23/2022

Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment

Word translation without parallel corpora has become feasible, rivaling ...
research
06/03/2023

Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling

Acoustic word embeddings are typically created by training a pooling fun...
research
11/13/2020

Learning language variations in news corpora through differential embeddings

There is an increasing interest in the NLP community in capturing variat...

Please sign up or login with your details

Forgot password? Click here to reset