Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

12/16/2021
by   Diptesh Kanojia, et al.
1

Cognates are variants of the same lexical form across different languages; for example 'fonema' in Spanish and 'phoneme' in English are cognates, both of which mean 'a unit of sound'. The task of automatic detection of cognates among any two languages can help downstream NLP tasks such as Cross-lingual Information Retrieval, Computational Phylogenetics, and Machine Translation. In this paper, we demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian Languages. Our approach introduces the use of context from a knowledge graph to generate improved feature representations for cognate detection. We, then, evaluate the impact of our cognate detection mechanism on neural machine translation (NMT), as a downstream task. We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages, namely, Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. Additionally, we create evaluation datasets for two more Indian languages, Konkani and Nepali. We observe an improvement of up to 18 F-score, for cognate detection. Furthermore, we observe that cognates extracted using our method help improve NMT quality by up to 2.76 BLEU. We also release our code, newly constructed datasets and cross-lingual models publicly.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/01/2019

Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation

The recently proposed massively multilingual neural machine translation ...
research
09/05/2018

BPE and CharCNNs for Translation of Morphology: A Cross-Lingual Comparison and Analysis

Neural Machine Translation (NMT) in low-resource settings and of morphol...
research
12/15/2021

Cognition-aware Cognate Detection

Automatic detection of cognates helps downstream NLP tasks of Machine Tr...
research
12/17/2021

Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

Cognates are present in multiple variants of the same text across differ...
research
10/26/2020

Constraint Translation Candidates: A Bridge between Neural Query Translation and Cross-lingual Information Retrieval

Query translation (QT) is a key component in cross-lingual information r...
research
12/30/2021

Utilizing Wordnets for Cognate Detection among Indian Languages

Automatic Cognate Detection (ACD) is a challenging task which has been u...
research
09/15/2021

Regressive Ensemble for Machine Translation Quality Evaluation

This work introduces a simple regressive ensemble for evaluating machine...

Please sign up or login with your details

Forgot password? Click here to reset