Sanskrit Sandhi Splitting using seq2(seq)^2

01/01/2018 ∙ by Neelamadhav Gantayat, et al. ∙ ibm 0

In Sanskrit, small words (morphemes) are combined through a morphophonological process called Sandhi to form compound words. Sandhi splitting is the process of splitting a given compound word into its constituent morphemes. Although rules governing the splitting of words exist, it is highly challenging to identify the location of the splits in a compound word, as the same compound word might be broken down in multiple ways to provide syntactically correct splits. where the split(s) occur, as the same compound word might be broken down in multiple ways to provide partly correct splits. Existing systems explore incorporating these pre-defined splitting rules, but have low accuracy since they don't address the fundamental problem of identifying the split location. With this work, we propose a novel Double Decoder RNN (DD-RNN) architecture which i) predicts the location of the split(s) with an accuracy of 95% and ii) predicts the constituent words (i.e. learning the Sandhi splitting rules) with an accuracy of 79.5%. To the best of our knowledge, deep learning techniques have never been applied to the Sandhi splitting problem before. We further demonstrate that our model out-performs the previous state-of-the-art significantly.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compound word formation in Sanskrit is governed by a set of deterministic rules following a well-defined structure described in Pāṇini’s Aṣṭādhyāyī, a seminal work on Sanskrit grammar. The process of merging two or more morphemes to form a word in Sanskrit is called Sandhi and the process of breaking a compound word into its constituent morphemes is called Sandhi splitting. In Japanese, Rendaku (‘sequential voicing’) is similar to Sandhi. For example, ‘origami’ consists of ‘ori’ (paper) + ‘kami’ (folding), where ‘kami’ changes to ‘gami’ due to Rendaku.

Learning the process of sandhi splitting for Sanskrit could provide linguistic insights into the formation of words in a wide-variety of Dravidian languages. From an NLP perspective, automated learning of word formations in Sanskrit could provide a framework for learning word organization in other Indian languages, as well Bharati et al. (2006). In literature, past works have explored sandhi splitting Gillon (2009) Kulkarni and Shukl (2009), as a rule based problem by applying the rules from Aṣṭādhyāyī in a brute force manner.

Figure 1: Different possible splits for the word paropakāraḥ and protsāhaḥ, provided by a standard Sandhi splitter.

Consider the example in Figure 1 illustrating the different possible splits of a compound word paropakāraḥ. While the correct split is para + upakāraḥ, other forms of splits such as, para + apa + kāraḥ are syntactically possible while semantically incorrect111Different syntactic splits given by one of the popular Sandhi splitters: and Thus, knowing all the rules of splitting is insufficient and it is essential to identify the location(s) of split(s) in a given compound word.

In this research, we propose an approach for automated generation of split words by first learning the potential split locations in a compound word. We use a deep bi-directional character RNN encoder and two decoders with attention, . The accuracy of our approach on the benchmark dataset for split location prediction is and for split words prediction is respectively. To the best of our knowledge, this is the first research work to explore deep learning techniques for the problem of Sanskrit Sandhi splitting, along with producing state-of-art results. Additionally, we show the performance of our proposed model for Chinese word segmentation to demonstrate the model’s generalization capability.

2 : Model Description

In this section, we present our double decoder model to address the Sandhi splitting problem. We first outline the issues with basic deep learning architectures and conceptually highlight the advantages of the double decoder model.

2.1 Issues with standard architectures

Consider an example of splitting a sequence abcdefg as abcdx + efg. The primary task is to identify d as the split location. Further, for a given location d in the character sequence, the algorithm should take into account (i) the context of character sequence abc, (ii) the immediate previous character c, (iii) the immediate succeeding character e, to make an effective split. For such sequence learning problems, RNNs have become the most popular deep learning model Pascanu et al. (2013) Sak et al. (2014).

A basic RNN encoder-decoder model Cho et al. (2014) with LSTM units Hochreiter and Schmidhuber (1997), similar to a machine translation model, was trained initially. The compound word’s characters is fed as input to the encoder and is translated to a sequence of characters representing the split words (‘+’ symbol acts as a separator between the generated split words). However, the model did not yield adequate performance as it encoded only the context of the characters that appeared before the potential split location(s). Though we tried making the encoder bi-directional (referred to as B-RNN), the model’s performance only improved marginally. Adding global attention (referred to as B-RNN-A) to the decoder enabled the model to attend to the characters surrounding the potential split location(s) and improved the split prediction performance, making it comparable with some of the best performing tools in the literature.

2.2 Double Decoder RNN (DD-RNN) model

The critical part of learning to split compound words is to correctly identify the location(s) of the split(s). Therefore, we added a two decoders to our bi-directional encoder-decoder model: (i) location decoder which learns to predict the split locations and (ii) character decoder which generates the split words. A compound word is fed into the encoder character by character. Each character’s embedding

is passed to the encoders LSTM units. There are two LSTM layers which encode the word, one in forward direction and the other backward. The encoded context vector

is then passed to a global attention layer.

In the first phase of training, only the location decoder is trained and the character decoder is frozen. The character embeddings are learned from scratch in this phase along with the attention weights and other parameters. Here, the model learns to identify the split locations. For example, if the inputs are the embeddings for the compound word protsāhaḥ, the location decoder will generate a binary vector which indicates that the split occurs between the third and fourth characters. In the second phase, the location decoder is frozen and the character decoder is trained. The encoder and attention weights are allowed to be fine-tuned. This decoder learns the underlying rules of Sandhi splitting. Since the attention layer is already pre-trained to identify potential split locations in the previous phase, the character decoder can use this context and learn to split the words more accurately. For example, for the same input word protsāhaḥ, the character decoder will generate [p, r, a, +, u, t, s, ā, h, a, ḥ] as the output. Here the character o is split into two characters a and u.

Figure 2: The bi-directional encoder and decoders with attention

In both the training phases, we use negative log likelihood as the loss function. Let

be the sequence of the input compound word’s characters and be the binary vector which indicates the location of the split(s) in the first phase and the true target sequence of characters which form the split words in the second phase. If , then the loss function is defined as:

We evaluate the DD-RNN and compare it with other tools and architectures in Section 4.

2.3 Implementation details

The architecture of the DD-RNN is shown in Figure 2. We used a character embedding size of . The bi-directional encoder and the two decoders are layers deep with LSTM units in each layer. A dropout layer with

is applied after each LSTM layer. The entire network is implemented in Torch 


Of the words in our benchmark dataset, we randomly sampled of the data for training our models. The remaining

was used for testing. We used stochastic gradient descent for optimizing the model parameters with an initial learning rate of

. The learning rate was decayed by a factor of

if the validation perplexity did not improve after an epoch. We used a batch size of

and trained the network for epochs on four Tesla K80 GPUs. This setup remains the same for all the experiments we conduct.

3 Existing Datasets and Tools

In this section, we briefly introduce various Sanskirt Sandhi datasets and splitting tools available in literature. We also discuss the tools’ drawbacks and the major challenges faced while creating such tools.

Datasets: The UoH corpus, created at the University of Hyderabad333Available at: contains words and their splits. This dataset is noisy with typing errors and incorrect splits. The recent SandhiKosh corpus Shubham Bhardwaj (2018) is a set of

annotated splits. We combine these datasets and heuristically prune them to finally get

words and their splits. The pruning is done by considering a data point to be valid only if the compound word and it’s splits are present in a standard Sanskrit dictionary Monier-Williams (1970). We use this as our benchmark dataset and run all our experiments on it.

Tools: There exist multiple Sandhi splitters in the open domain such as (i) JNU splitter Sachin (2007), (ii) UoH splitter Kumar et al. (2010) and (iii) INRIA sanskrit reader companion Huet (2003) Goyal and Huet (2013). Though each tool addresses the splitting problem in a specialized way, the general principle remains constant. For a given compound word, the set of all rules are applied to every character in the word and a large potential candidate list of word splits is obtained. Then, a morpheme dictionary of Sanskrit words is used with other heuristics to remove infeasible word split combinations. However, none of the approaches address the fundamental problem of identifying the location of the split before applying the rules, which will significantly reduce the number of rules that can be applied, hence resulting in more accurate splits.

4 Evaluation and Results

We evaluate the performance of our DD-RNN model by: (i) comparing the split prediction accuracy with other publicly available sandhi splitting tools, (ii) comparing the split prediction accuracy with other standard RNN architectures such as RNN, B-RNN, and B-RNN-A, and (iii) comparing the location prediction accuracy with the RNNs used for Chinese word segmentation (as they only predict the split locations and do not learn the rules of splitting)

4.1 Comparison with publicly available tools

The tools discussed in Section 3 take a compound word as input and provide a list of all possible splits as output (UoH and INRIA splitters provide weighted lists). Initially, we compared only the top prediction in each list with the true output. This gave a very low precision for the tools as shown in Figure 3. Therefore, we relaxed this constraint and considered an output to be correct if the true split is present in the top ten predictions of the list. This increased the precision of the tools as shown in Figure 4 and Table 1.

Figure 3: Top-1 split prediction accuracy comparison of different publicly available tools with DD-RNN
Figure 4: Split prediction accuracy comparison of different publicly available tools (Top-10) with DD-RNN (Top-1)

Even though DD-RNN generates only one output for every input, it clearly out-performs the other publicly available tools by a fair margin.

4.2 Comparison with standard RNN architectures

To compare the performance of DD-RNN with other standard RNN architectures, we trained the following three models to generate the split predictions on our benchmark dataset: (i) uni-directional encoder and decoder without attention (RNN), (ii) bi-directional encoder and decoder without attention (B-RNN), and (iii) bi-directional encoder and decoder with attention (B-RNN-A)

Accuracy (%)
JNU (Top 10) - 8.1
UoH (Top 10) - 47.2
INRIA (Top 10) - 59.9
RNN 79.10 56.6
B-RNN 84.62 58.6
B-RNN-A 88.53 69.3
DD-RNN 95.0 79.5
LSTM-4 70.2 -
GRNN-5 67.7 -
Table 1: Location and split prediction accuracy of all the tools and models under comparison

As seen from the middle part of Table 1, the DD-RNN performs much better than the other architectures with an accuracy of 79.5%. It is to be noted that B-RNN-A is the same as DD-RNN without the location decoder. However, the accuracy of DD-RNN is % more than that the B-RNN-A and consistently outperforms B-RNN-A on almost all word lengths (Figure 5). This indicates that the attention mechanism of DD-RNN has learned to better identify the split location(s) due to its pre-training with the location decoder.

Figure 5: Split prediction accuracy comparison of different variations of RNN on words of different lengths

4.3 Comparison with similar works

Reddy et al. (2018) propose a seq2seq model with attention to tackle the Sandhi problem. Their model is similar to B-RNN-A and is outperformed by our proposed DD-RNN by 6̃.47%. We also compared our proposed DD-RNN with a uni-directional LSTM with a depth of  Chen et al. (2015b) (LSTM-4

) and a Gated Recursive Neural Network with a depth of

 Chen et al. (2015a) (GRNN-5). These models were used to get state of the art results for Chinese word segmentation and their source code is made available online.444 Since these models can only predict the location(s) of the split(s) and cannot generate the split words themselves, we used the location prediction accuracy as the metric. We trained these models on our benchmark dataset and the results are shown in Table 1. DD-RNN’s precision is % and % better than LSTM-4 and GRNN-5 respectively. Conversely, we trained the DD-RNN for the Chinese word segmentation task to test the generalizability of the model. Since there are no morphological changes during segmentation in Chinese, the character decoder is redundant and the model collapses to simple seq2seq. We used the PKU dataset which is also used in Chen et al. (2015b)Chen et al. (2015a) and obtained an accuracy of 64.25% which is comparable to the results of other standard models.

To summarize, we have used our benchmark dataset to compare the DD-RNN model with existing publicly available Sandhi splitting tools, other RNN architectures and models used for Chinese word segmentation task. Among the existing tools, the INRIA splitter gives the highest split prediction accuracy of 59.9%. Among the standard RNN architectures, B-RNN-A performs the best with a split prediction accuracy of 69.3%. LSTM-4 performs the best among the Chinese word segmentation models with a location prediction accuracy of 70.2%. DD-RNN outperforms all the models both in location and split predictions with 95% and 79.5% accuracies, respectively.

5 Research Impact

This work can be foundational to other Sanskrit based NLP tasks. Let us consider translation as an example. In Sanskrit, arbitrary number of words can be joined together to form a compound word. Literary works, especially from the Vedic era often contain words which are a concatenation of three or more simpler words. Presence of such compound words will increase the vocabulary size exponentially and hinder the translation process. However, as a pre-processing step, if all the compound words are split before training a translation model, the number of unique words in the vocabulary reduces which will ease the learning process.

6 Conclusion

In this research, we propose a novel double decoder RNN architecture with attention for Sanskrit Sandhi splitting. A deep bi-directional encoder is used to encode the character sequence of a Sanskrit word. Using this encoded context vector, a location decoder is first used to learn the location(s) of the split(s). Then the character decoder is used to generate the split words. We evaluate the performance of the proposed approach on the benchmark dataset in comparison with other publicly available tools, standard RNN architectures and with prior work which tackle similar problems in other languages. As future work, we intend to tackle the harder Samasa problem which requires semantic information of a word in addition to the characters’ context.