Isomorphic Cross-lingual Embeddings for Low-Resource Languages

03/28/2022
by   Sonal Sannigrahi, et al.
0

Cross-Lingual Word Embeddings (CLWEs) are a key component to transfer linguistic information learnt from higher-resource settings into lower-resource ones. Recent research in cross-lingual representation learning has focused on offline mapping approaches due to their simplicity, computational efficacy, and ability to work with minimal parallel resources. However, they crucially depend on the assumption of embedding spaces being approximately isomorphic i.e. sharing similar geometric structure, which does not hold in practice, leading to poorer performance on low-resource and distant language pairs. In this paper, we introduce a framework to learn CLWEs, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language. In our work, we first pre-align the low-resource and related language embedding spaces using offline methods to mitigate the assumption of isometry. Following this, we use joint training methods to develops CLWEs for the related language and the target embed-ding space. Finally, we remap the pre-aligned low-resource space and the target space to generate the final CLWEs. We show consistent gains over current methods in both quality and degree of isomorphism, as measured by bilingual lexicon induction (BLI) and eigenvalue similarity respectively, across several language pairs: Nepali, Finnish, Romanian, Gujarati, Hungarian-English. Lastly, our analysis also points to the relatedness as well as the amount of related language data available as being key factors in determining the quality of embeddings achieved.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/23/2020

Anchor-based Bilingual Word Embeddings for Low-Resource Languages

Bilingual word embeddings (BWEs) are useful for many cross-lingual appli...
research
04/28/2020

LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space

Most of the successful and predominant methods for bilingual lexicon ind...
research
05/17/2020

Cross-Lingual Word Embeddings for Turkic Languages

There has been an increasing interest in learning cross-lingual word emb...
research
12/22/2018

Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

Text classification must sometimes be applied in situations with no trai...
research
11/15/2022

SexWEs: Domain-Aware Word Embeddings via Cross-lingual Semantic Specialisation for Chinese Sexism Detection in Social Media

The goal of sexism detection is to mitigate negative online content targ...
research
08/19/2019

Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces

Recent work on bilingual lexicon induction (BLI) has frequently depended...
research
08/19/2022

Effective Transfer Learning for Low-Resource Natural Language Understanding

Natural language understanding (NLU) is the task of semantic decoding of...

Please sign up or login with your details

Forgot password? Click here to reset