A Simple Method for Unsupervised Bilingual Lexicon Induction for Data-Imbalanced, Closely Related Language Pairs

05/23/2023
by   Niyati Bafna, et al.
0

Existing approaches for unsupervised bilingual lexicon induction (BLI) often depend on good quality static or contextual embeddings trained on large monolingual corpora for both languages. In reality, however, unsupervised BLI is most likely to be useful for dialects and languages that do not have abundant amounts of monolingual data. We introduce a simple and fast method for unsupervised BLI for low-resource languages with a related mid-to-high resource language, only requiring inference on the higher-resource language monolingual BERT. We work with two low-resource languages (<5M monolingual tokens), Bhojpuri and Magahi, of the severely under-researched Indic dialect continuum, showing that state-of-the-art methods in the literature show near-zero performance in these settings, and that our simpler method gives much better results. We repeat our experiments on Marathi and Nepali, two higher-resource Indic languages, to compare approach performances by resource range. We release automatically created bilingual lexicons for the first time for five languages of the Indic dialect continuum.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/31/2021

Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data

The scarcity of parallel data is a major obstacle for training high-qual...
research
01/31/2020

Unsupervised Bilingual Lexicon Induction Across Writing Systems

Recent embedding-based methods in unsupervised bilingual lexicon inducti...
research
09/23/2020

Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages

Unsupervised translation has reached impressive performance on resource-...
research
10/05/2020

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

The lack or absence of parallel and comparable corpora makes bilingual l...
research
08/31/2018

Generalizing Procrustes Analysis for Better Bilingual Dictionary Induction

Most recent approaches to bilingual dictionary induction find a linear a...
research
03/04/2016

A Bayesian Model of Multilingual Unsupervised Semantic Role Induction

We propose a Bayesian model of unsupervised semantic role induction in m...
research
04/28/2020

LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space

Most of the successful and predominant methods for bilingual lexicon ind...

Please sign up or login with your details

Forgot password? Click here to reset