Unsupervised Separation of Transliterable and Native Words for Malayalam

03/26/2018
by   Deepak P, et al.
0

Differentiating intrinsic language words from transliterable words is a key step aiding text processing tasks involving different natural languages. We consider the problem of unsupervised separation of transliterable words from native words for text in Malayalam language. Outlining a key observation on the diversity of characters beyond the word stem, we develop an optimization method to score words based on their nativeness. Our method relies on the usage of probability distributions over character n-grams that are refined in step with the nativeness scorings in an iterative optimization formulation. Using an empirical evaluation, we illustrate that our method, DTIM, provides significant improvements in nativeness scoring for Malayalam, establishing DTIM as the preferred method for the task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/12/2020

Unsupervised Separation of Native and Loanwords for Malayalam and Telugu

Quite often, words from one language are adopted within a different lang...
research
07/02/2020

Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset

This paper describes the Dakshina dataset, a new resource consisting of ...
research
01/05/2021

edATLAS: An Efficient Disambiguation Algorithm for Texting in Languages with Abugida Scripts

Abugida refers to a phonogram writing system where each syllable is repr...
research
08/10/2019

Unsupervised Stemming based Language Model for Telugu Broadcast News Transcription

In Indian Languages , native speakers are able to understand new words f...
research
09/06/2017

The Voynich Manuscript is Written in Natural Language: The Pahlavi Hypothesis

The late medieval Voynich Manuscript (VM) has resisted decryption and wa...
research
11/05/2020

Towards Dark Jargon Interpretation in Underground Forums

Dark jargons are benign-looking words that have hidden, sinister meaning...
research
02/13/2022

Omnifont Persian OCR System Using Primitives

In this paper, we introduce a model-based omnifont Persian OCR system. T...

Please sign up or login with your details

Forgot password? Click here to reset