Challenge Dataset of Cognates and False Friend Pairs from Indian Languages

12/17/2021
by   Diptesh Kanojia, et al.
9

Cognates are present in multiple variants of the same text across different languages (e.g., "hund" in German and "hound" in English language mean "dog"). They pose a challenge to various Natural Language Processing (NLP) applications such as Machine Translation, Cross-lingual Sense Disambiguation, Computational Phylogenetics, and Information Retrieval. A possible solution to address this challenge is to identify cognates across language pairs. In this paper, we describe the creation of two cognate datasets for twelve Indian languages, namely Sanskrit, Hindi, Assamese, Oriya, Kannada, Gujarati, Tamil, Telugu, Punjabi, Bengali, Marathi, and Malayalam. We digitize the cognate data from an Indian language cognate dictionary and utilize linked Indian language Wordnets to generate cognate sets. Additionally, we use the Wordnet data to create a False Friends' dataset for eleven language pairs. We also evaluate the efficacy of our dataset using previously available baseline cognate detection approaches. We also perform a manual evaluation with the help of lexicographers and release the curated gold-standard dataset with this paper.

READ FULL TEXT
research
12/16/2021

Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

Cognates are variants of the same lexical form across different language...
research
12/30/2021

Utilizing Wordnets for Cognate Detection among Indian Languages

Automatic Cognate Detection (ACD) is a challenging task which has been u...
research
12/02/2020

A Computational Approach to Measuring the Semantic Divergence of Cognates

Meaning is the foundation stone of intercultural communication. Language...
research
12/27/2021

"A Passage to India": Pre-trained Word Embeddings for Indian Languages

Dense word vectors or 'word embeddings' which encode semantic properties...
research
08/26/2023

GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench

With the emergence of Machine Learning, there has been a surge in levera...
research
10/12/2022

SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment

Word alignments are essential for a variety of NLP tasks. Therefore, cho...
research
01/09/2022

Indian Language Wordnets and their Linkages with Princeton WordNet

Wordnets are rich lexico-semantic resources. Linked wordnets are extensi...

Please sign up or login with your details

Forgot password? Click here to reset