Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

04/19/2023
by   Ekaterina Artemova, et al.
0

Bilingual word lexicons are crucial tools for multilingual natural language understanding and machine translation tasks, as they facilitate the mapping of words in one language to their synonyms in another language. To achieve this, numerous papers have explored bilingual lexicon induction (BLI) in high-resource scenarios, using a typical pipeline consisting of two unsupervised steps: bitext mining and word alignment, both of which rely on pre-trained large language models (LLMs). In this paper, we present an analysis of the BLI pipeline for German and two of its dialects, Bavarian and Alemannic. This setup poses several unique challenges, including the scarcity of resources, the relatedness of the languages, and the lack of standardization in the orthography of dialects. To evaluate the BLI outputs, we analyze them with respect to word frequency and pairwise edit distance. Additionally, we release two evaluation datasets comprising 1,500 bilingual sentence pairs and 1,000 bilingual word pairs. They were manually judged for their semantic similarity for each Bavarian-German and Alemannic-German language pair.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/01/2021

Low-Resource Language Modelling of South African Languages

Language models are the foundation of current neural network-based model...
research
09/14/2023

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Despite the progress we have recorded in the last few years in multiling...
research
02/04/2021

One Size Does Not Fit All: Finding the Optimal N-gram Sizes for FastText Models across Languages

Unsupervised word representation learning from large corpora is badly ne...
research
01/26/2018

Context Models for OOV Word Translation in Low-Resource Languages

Out-of-vocabulary word translation is a major problem for the translatio...
research
09/28/2020

Neural Baselines for Word Alignment

Word alignments identify translational correspondences between words in ...
research
10/18/2022

RAPO: An Adaptive Ranking Paradigm for Bilingual Lexicon Induction

Bilingual lexicon induction induces the word translations by aligning in...
research
09/26/2021

Curb Your Carbon Emissions: Benchmarking Carbon Emissions in Machine Translation

In recent times, there has been definitive progress in the field of NLP,...

Please sign up or login with your details

Forgot password? Click here to reset