Building a Word Segmenter for Sanskrit Overnight

02/17/2018
by   Vikas Reddy, et al.
0

There is an abundance of digitised texts available in Sanskrit. However, the word segmentation task in such texts are challenging due to the issue of 'Sandhi'. In Sandhi, words in a sentence often fuse together to form a single chunk of text, where the word delimiter vanishes and sounds at the word boundaries undergo transformations, which is also reflected in the written text. Here, we propose an approach that uses a deep sequence to sequence (seq2seq) model that takes only the sandhied string as the input and predicts the unsandhied string. The state of the art models are linguistically involved and have external dependencies for the lexical and morphological analysis of the input. Our model can be trained "overnight" and be used for production. In spite of the knowledge lean approach, our system preforms better than the current state of the art by gaining a percentage increase of 16.79 current state of the art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/21/2018

MorphNet: A sequence-to-sequence model that combines morphological analysis and disambiguation

We introduce MorphNet, a single model that combines morphological analys...
research
10/21/2020

Controllable Text Simplification with Explicit Paraphrasing

Text Simplification improves the readability of sentences through severa...
research
09/06/2018

Upcycle Your OCR: Reusing OCRs for Post-OCR Text Correction in Romanised Sanskrit

We propose a post-OCR text correction approach for digitising texts in R...
research
10/24/2020

Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language

This paper describes neural network based approaches to the process of t...
research
11/13/2019

Mark my Word: A Sequence-to-Sequence Approach to Definition Modeling

Defining words in a textual context is a useful task both for practical ...
research
05/13/2020

Sanskrit Segmentation Revisited

Computationally analyzing Sanskrit texts requires proper segmentation in...
research
09/05/2018

Free as in Free Word Order: An Energy Based Model for Word Segmentation and Morphological Tagging in Sanskrit

The configurational information in sentences of a free word order langua...

Please sign up or login with your details

Forgot password? Click here to reset