Utilizing Wordnets for Cognate Detection among Indian Languages

12/30/2021
by   Diptesh Kanojia, et al.
3

Automatic Cognate Detection (ACD) is a challenging task which has been utilized to help NLP applications like Machine Translation, Information Retrieval and Computational Phylogenetics. Unidentified cognate pairs can pose a challenge to these applications and result in a degradation of performance. In this paper, we detect cognate word pairs among ten Indian languages with Hindi and use deep learning methodologies to predict whether a word pair is cognate or not. We identify IndoWordnet as a potential resource to detect cognate word pairs based on orthographic similarity-based methods and train neural network models using the data obtained from it. We identify parallel corpora as another potential resource and perform the same experiments for them. We also validate the contribution of Wordnets through further experimentation and report improved performance of up to 26 nuances of cognate detection among closely related Indian languages and release the lists of detected cognates as a dataset. We also observe the behaviour of, to an extent, unrelated Indian language pairs and release the lists of detected cognates among them as well.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/17/2021

Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

Cognates are present in multiple variants of the same text across differ...
research
12/16/2021

Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

Cognates are variants of the same lexical form across different language...
research
10/05/2020

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

The lack or absence of parallel and comparable corpora makes bilingual l...
research
08/09/2023

Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists

We present a cross-linguistic study that aims to quantify vowel harmony ...
research
06/17/2022

The ITU Faroese Pairs Dataset

This article documents a dataset of sentence pairs between Faroese and D...
research
10/12/2022

SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment

Word alignments are essential for a variety of NLP tasks. Therefore, cho...
research
08/26/2020

Tabular Structure Detection from Document Images for Resource Constrained Devices Using A Row Based Similarity Measure

Tabular structures are used to present crucial information in a structur...

Please sign up or login with your details

Forgot password? Click here to reset