When Word Embeddings Become Endangered

03/24/2021
by   Khalid Alnajjar, et al.
0

Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis model which achieved a high accuracy. All our cross-lingual word embeddings and the sentiment analysis model have been released openly via an easy-to-use Python library.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2020

Cross-Lingual Word Embeddings for Turkic Languages

There has been an increasing interest in learning cross-lingual word emb...
research
07/09/2018

Predicting Concreteness and Imageability of Words Within and Across Languages via Word Embeddings

The notions of concreteness and imageability, traditionally important in...
research
05/23/2018

Bilingual Sentiment Embeddings: Joint Projection of Sentiment Across Languages

Sentiment analysis in low-resource languages suffers from a lack of anno...
research
04/10/2023

On Evaluation of Bangla Word Analogies

This paper presents a high-quality dataset for evaluating the quality of...
research
10/04/2018

Neural Networks for Cross-lingual Negation Scope Detection

Negation scope has been annotated in several English and Chinese corpora...
research
10/26/2022

Sinhala Sentence Embedding: A Two-Tiered Structure for Low-Resource Languages

In the process of numerically modeling natural languages, developing lan...
research
01/11/2021

Clustering Word Embeddings with Self-Organizing Maps. Application on LaRoSeDa – A Large Romanian Sentiment Data Set

Romanian is one of the understudied languages in computational linguisti...

Please sign up or login with your details

Forgot password? Click here to reset