Using n-aksaras to model Sanskrit and Sanskrit-adjacent texts

01/30/2023
by   Charles Li, et al.
0

Despite – or perhaps because of – their simplicity, n-grams, or contiguous sequences of tokens, have been used with great success in computational linguistics since their introduction in the late 20th century. Recast as k-mers, or contiguous sequences of monomers, they have also found applications in computational biology. When applied to the analysis of texts, n-grams usually take the form of sequences of words. But if we try to apply this model to the analysis of Sanskrit texts, we are faced with the arduous task of, firstly, resolving sandhi to split a phrase into words, and, secondly, splitting long compounds into their components. This paper presents a simpler method of tokenizing a Sanskrit text for n-grams, by using n-aksaras, or contiguous sequences of aksaras. This model reduces the need for sandhi resolution, making it much easier to use on raw text. It is also possible to use this model on Sanskrit-adjacent texts, e.g., a Tamil commentary on a Sanskrit text. As a test case, the commentaries on Amarakosa 1.0.1 have been modelled as n-aksaras, showing patterns of text reuse across ten centuries and nine languages. Some initial observations are made concerning Buddhist commentarial practices.

READ FULL TEXT
research
04/26/2020

Detect Language of Transliterated Texts

Informal transliteration from other languages to English is prevalent in...
research
03/27/2015

Normalization of Non-Standard Words in Croatian Texts

This paper presents text normalization which is an integral part of any ...
research
05/25/2021

Context-Sensitive Visualization of Deep Learning Natural Language Processing Models

The introduction of Transformer neural networks has changed the landscap...
research
12/20/2016

Inferring the location of authors from words in their texts

For the purposes of computational dialectology or other geographically b...
research
12/12/2022

Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Video-grounded Dialogue (VGD) aims to decode an answer sentence to a que...
research
04/17/2021

Customized determination of stop words using Random Matrix Theory approach

The distances between words calculated in word units are studied and com...
research
06/17/2020

De-Anonymizing Text by Fingerprinting Language Generation

Components of machine learning systems are not (yet) perceived as securi...

Please sign up or login with your details

Forgot password? Click here to reset