Unsupervised Separation of Native and Loanwords for Malayalam and Telugu

02/12/2020
by   Sridhama Prakhya, et al.
0

Quite often, words from one language are adopted within a different language without translation; these words appear in transliterated form in text written in the latter language. This phenomenon is particularly widespread within Indian languages where many words are loaned from English. In this paper, we address the task of identifying loanwords automatically and in an unsupervised manner, from large datasets of words from agglutinative Dravidian languages. We target two specific languages from the Dravidian family, viz., Malayalam and Telugu. Based on familiarity with the languages, we outline an observation that native words in both these languages tend to be characterized by a much more versatile stem - stem being a shorthand to denote the subword sequence formed by the first few characters of the word - than words that are loaned from other languages. We harness this observation to build an objective function and an iterative optimization formulation to optimize for it, yielding a scoring of each word's nativeness in the process. Through an extensive empirical analysis over real-world datasets from both Malayalam and Telugu, we illustrate the effectiveness of our method in quantifying nativeness effectively over available baselines for the task.

READ FULL TEXT
research
03/26/2018

Unsupervised Separation of Transliterable and Native Words for Malayalam

Differentiating intrinsic language words from transliterable words is a ...
research
03/26/2022

Joint Transformer/RNN Architecture for Gesture Typing in Indic Languages

Gesture typing is a method of typing words on a touch-based keyboard by ...
research
03/11/2020

Visual Grounding in Video for Unsupervised Word Translation

There are thousands of actively spoken languages on Earth, but a single ...
research
07/02/2020

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

This paper describes the Dakshina dataset, a new resource consisting of ...
research
01/05/2021

edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Abugida refers to a phonogram writing system where each syllable is repr...
research
12/21/2022

Universal versus system-specific features of punctuation usage patterns in major Western languages

The celebrated proverb that "speech is silver, silence is golden" has a ...
research
08/23/2022

Universality and diversity in word patterns

Words are fundamental linguistic units that connect thoughts and things ...

Please sign up or login with your details

Forgot password? Click here to reset