Language modeling is a fundamental problem in speech and language processing that involves predicting the next word given its context. Recurrent neural network language models (RNNLMs) have become the de facto standard for language modeling. They typically produce a next-word probability distribution over a fixed vocabulary of words. Such an approach has two main limitations. Word embeddings for infrequently occurring words in training data are poorly estimated. Also, predictions at the word level are largely immune to the subword structure in words. Both these limitations are exacerbated for morphologically rich languages in which words have numerous morphological variants, leading to large vocabularies where a significant fraction of words appear in the long tail of the word distribution. Leveraging subword information becomes especially important for such languages.
In prior work, RNNLMs have typically exploited subword-level information at the input side and learn improved word embeddings by utilizing morpheme- and character-level information.  present an exhaustive comparison of many such methods. Incorporating subword information within the output layer of RNNLMs has received less attention. We explore this direction and make the following specific contributions:
We present a new stem-based neural LM that predicts a mixture of stem probabilities and a mixture of word probabilities and meaningfully combines them. We also outline an unsupervised algorithm to identify stems of words.
We also present stem-based models that use multi-task learning and consistently outperform their word-based counterparts.
We demonstrate the effectiveness of our proposed architecture by showing significant reductions in perplexities on four morphologically rich languages, Hindi, Tamil, Kannada and Finnish.
We provide a detailed analysis of the benefits of our stem-driven approach and also contrast our model with a control task that highlights the importance of stems.
2 Model Description
For a given word , an RNNLM encodes its context into a fixed-size representation, . An RNNLM predicts by applying a softmax function to an affine transformation of ( and are the model parameters):
In a departure from this standard formulation, we could also use a mixture of different softmax distributions at the output layer. Compared to using a single softmax distribution, mixture model-based LMs lead to improved generalization abilities and translate into substantial reductions in test perplexities [16, 15, 25]. If denotes the model at time , the next-word distribution in a mixture model becomes:
RNNLMs are trained to minimize a cross-entropy loss function computed over all training tokens (indexed by), where
can be estimated using a single softmax layer or a mixture of softmax layers as defined in Eqn1 or Eqn 2, respectively.
In all subsequent sections, we assume access to a stem for each word (which could be the word itself). We obtain this stem information using an unsupervised stemming algorithm that is detailed in Section 2.4.
2.2 Stem-based LMs using multi-task learning
We treat the standard RNNLM defined in Eqn 1 as the primary task of predicting words and augment it with an auxiliary task of predicting stems. Unlike the loss function for the primary task that is computed over words, the auxiliary task is trained using a cross-entropy loss between the predicted stem and the correct stem . As is standard in multi-task learning, a linear combination of both these losses will be optimized during training and
is a hyperparameter that we tune on a validation set. We refer to this model asMTL-S. During test time, we discard the auxiliary task and only use the word probabilities from the primary task.
In MTL-S, the auxiliary task incurs no loss for predicting the correct stems but incurs a loss even if it predicts the correct word. In order to relax this constraint, we optimize the auxiliary task using
only for the first few epochs and use a word-level cross-entropy lossfor the remaining epochs. We refer to this model as MTL-S2W.
2.3 Mix-WS: Using mixtures of words and stems
We define a new language model, Mix-WS, that uses estimates from two mixture models – one computed over words and another over stems – to compute the final probability for a word given its context. We first train a mixture-model over words as defined in Eqn 2; let us call this distribution . We also train a separate mixture-model over stems, . In this model, will correspond only to valid stems while encodes a context of past words; we estimate using a cross-entropy loss between the predicted stem and the correct stem.111We use the same vocabulary to represent both words and stems.
Below we list the sequence of steps used to estimate word probabilities from Mix-WS for a test word appearing in a word context (encoded as or ):
Compute the conditional probability of predicting given its stem where and .
Compute the marginal probability of predicting a stem .
The final probability for the test word given its context is computed as .
Figure 1 illustrates our proposed models. Both MTL-S and MTL-S2W use fixed values of to scale the loss terms specific to each task, while Mix-WS uses learned weights (that change with the word context) to mix probabilities before computing the word- and stem-specific losses. In the next section, we outline an unsupervised stem identification algorithm to derive stem-like entities from a word vocabulary.
2.4 Unsupervised stem identification algorithm
Algorithm 1 describes the pseudocode of our unsupervised segmentation method. For each prefix pair (and similarly for each suffix pair), we enforce which is true if , or and is lexicographicaly smaller than . Frequently occurring pairs, guided by the threshold parameters and , are chosen to form the set of prefix/suffix rules governing segmentation. Finally, for each word, the most frequent stem is returned as the output.
3 Experimental Setup
|# of training tokens||666K||434K||507K||585K|
|Type / Token (train)||0.08||0.22||0.21||0.197|
|# of dev tokens||50K||24K||39K||43K|
|# of test tokens||49K||29K||39K||44K|
|OoV rate (test)||5.3%||4.9%||15.2%||5.6%|
3.1 Dataset description
In this work, we use datasets from  for four morphologically rich languages, Finnish (Fi), Hindi (Hi), Kannada (Kn) and Tamil (Ta), along with their specified training/dev/test splits. Table 1 shows statistics for these four languages. Kn, Ta and Fi are more morphologically complex than Hi, which is apparent from their higher type-token ratios in Table 1.
3.2 Implementation details
PyTorch  was used to implement all models. We report two baseline numbers: (A) Char-CNN-LSTM: An RNNLM proposed by  that uses character-level inputs for a variety of languages and (B) LMMRL: An RNNLM proposed by  that improves over  by finetuning the output embeddings to capture subword-level information. We report numbers for (A) and (B) using our re-implementations of these baseline systems, which are better than the reported numbers in  (except for Hi).222We also investigated BPE-based neural LMs as baselines. But these produced significantly worse test perplexities than LMMRL for all languages. Since our datasets were all relatively small in size, we present test perplexities averaged over five random seeds for all models.
We used SGD for the baseline models and ran each model for 30 epochs based on . For all our proposed models, we used the Adam optimizer with the learning rate set to 5e-5 (decayed by 0.8) and ran each model for 15 epochs. The best value of for our MTL models was found by tuning on the development set for a single random seed. MTL-S2W was trained for 5 epochs to optimize and the remaining 10 epochs were used to optimize . All hyperparameters like batch size, sequence length, embedding size, LSTM parameters were kept constant across both languages and models.
|Char-CNN-LSTM||383.03 6.96||1403.19 58.05||2321.00 180.64||1998.00 90.25|
|LMMRL||375.51 10.88||1404.22 129.77||2241.00 240.80||2017.63 88.21|
|MTL-W||305.43 17.51||971.40 65.45||1567.03 193.36||1322.81 46.48|
|MTL-S||311.08 21.18||946.58 57.69||1529.65 168.07||1411.16 88.94|
|MTL-S2W||304.24 20.91||918.45 48.39||1489.92 155.60||1374.81 68.56|
|Mix-W||284.91 18.76||795.93 100.74||1178.22 322.61||1112.46 282.13|
|Mix-WS||265.27 23.68||764.76 119.03||1160.43 224.41||1133.48 130.89|
Average perplexities (with standard deviations) on Hi, Kn, Ta and Fi test sets.
4 Results and Analysis
4.1 Test perplexities of the proposed models
Table 2 summarizes test perplexities averaged over five random seeds on Hi, Kn, Ta and Fi. We present two new models, MTL-W and Mix-W to serve as fair comparisons to MTL-S, MTL-WS and Mix-WS. In MTL-W, both the primary and auxiliary tasks predict words and only the primary task is used at test time. Mix-W estimates a mixture model computed over words as defined in Eqn 2.
We observe that all our proposed models significantly outperform the two baselines, Char-CNN-LSTM and LMMRL, on all four languages. (On Kn and Ta, test perplexities are essentially halved.) MTL-W is almost always worse than MTL-S and MTL-S2W produces consistently lower perplexities than MTL-S. This suggests that incorporating stem information in the auxiliary task as a pretraining step (in MTL-S2W where the auxiliary task is trained first with a stem-based loss followed by a word-based loss) is beneficial to the primary word prediction task. The mixture models, Mix-WS and Mix-W, have significantly lower perplexities compared to all other models. Mix-WS also consistently outperforms Mix-W (except for Fi333Fig 2 shows the perplexity trends across models for all four languages. Although Mix-W performs slightly better than Mix-WS on average for Fi, the variance of
on average for Fi, the variance ofMix-W is significantly larger than for any other model.), further validating the importance of using a mixture-model for stems during test time. (Since Mix-WS has roughly twice the number of parameters in Mix-W, we trained a Mix-W model for Kn that was comparable in size. This performed worse than the Mix-W model listed in Table 2 giving a perplexity of 830.938.)
4.2 Control task to assess importance of stems
We set up a control task in order to assess the utility of stems identified by our unsupervised algorithm. First, we started with the list of stems for a word vocabulary from our algorithm. Next, we randomly assigned words to be associated with each of these stems. We were careful to assign the same number of words to a stem as in our algorithm so as to not alter its distribution. With this randomized word-to-stem assignment in place for Kn, we run our best model Mix-WS. We obtain test perplexities of 1161.91 and 764.76 using the randomized stem assignment and the stem assignments from our algorithm, respectively. This clearly shows that our derived stems are useful abstractions of the underlying words.
4.3 Supervised vs. Unsupervised stems
We hypothesise that our proposed models will perform even better if the quality of stems are further improved. In order to empirically validate this claim, we use a supervised segmentor for Finnish  to produce segments for each word which were then merged using a frequency-based criterion444The split was chosen by maximising the sum of resulting stem and suffix frequencies, inversely weighted by their global averages. We fixed the number of suffixes to 75 (same as that for our algorithm) to ensure a fair comparison. to generate the stem and suffix. With these stems in place, we obtain an averaged test perplexity of 1095.60 using the Mix-WS model, compared to 1133.48 using our unsupervised algorithm to generate stems.
4.4 Token-level perplexities for frequent and diverse stems
We compared the performance of the Mix-W and Mix-WS models on stems with sufficient coverage which have diverse word forms. We isolated stems that had 10 or more distinct word types that mapped to it, and these word tokens collectively appeared 500 or more times in the training data. We computed averaged test perplexities for only these tokens; Fig 3 shows these values. We see consistent improvements on these specific tokens; the gap between Mix-WS and Mix-W is much larger for these tokens for Kn and Ta.
5 Related Work
Prior work has looked at different ways in which morpheme or character-level information can be provided as input to RNNLMs [6, 10, 13, 12]. Approaches tailored specifically for morphologically rich languages include using constituent morpheme embeddings , using morphological recursive NNs , concatenating word and character embeddings  and using other factored representations of words [23, 1, 11, 3]. Factored RNNLMs that integrate multiple word features (POS tags, etc.) have also been explored in prior work [24, 9]. Fewer approaches have focused on injecting subword-level information into the output layer of neural LMs.  proposed a finetuning technique for word embeddings using a loss based on character-level similarities.  and  split words into subwords and trained an LM using subwords as tokens and  used a mixture model to predict at the word, morpheme and character-level.
In this work, we present stem-driven LMs for different morphologically rich languages and demonstrate their efficacy compared to competitive baseline models. We derive stems using a simple unsupervised technique and demonstrate how our models’ performance can be further improved with predicting better stems. In future work, we will examine the effect of different segmentation algorithms on LM performance.
Compositional representation of morphologically-rich input for neural machine translation. In ACL, Cited by: §5.
-  (2014) Compositional morphology for word representations and language modelling. In ICML, Cited by: §5.
-  (2016-08) A joint model for word embedding and word morphology. In Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 18–26. External Links: Cited by: §5.
-  (2016) Morphological segmentation inside-out. In EMNLP, Cited by: §2.4.
-  (2002) Unsupervised discovery of morphemes. CoRR cs.CL/0205057. Cited by: §2.4.
-  (2014) Learning character-level representations for part-of-speech tagging. In ICML, Cited by: §5.
-  (2018) Language modeling for morphologically rich languages: character-aware modeling for word-level prediction. Transactions of the Association for Computational Linguistics 6, pp. 451–465. Cited by: §3.1, §3.2, §3.2, §5.
-  (2001) Unsupervised learning of the morphology of a natural language. Computational Linguistics 27, pp. 153–198. Cited by: §2.4.
-  (2016) Incorporating side information into recurrent neural network language models. In Proceedings of NAACL, Cited by: §5.
-  (2016) Character-aware neural language models. In AAAI, Cited by: §3.2, §5.
-  (2017) Character and subword-based word representation for neural language modeling prediction. In SWCN@EMNLP, Cited by: §5.
Neural architectures for named entity recognition. In HLT-NAACL, Cited by: §5.
-  (2015) Character-based neural machine translation. CoRR abs/1511.04586. Cited by: §5.
-  (2013) Better word representations with recursive neural networks for morphology. In CoNLL, Cited by: §5.
-  (2018) Using morphological knowledge in open-vocabulary neural language models. In NAACL-HLT, Cited by: §2.1, §5.
-  (2016) Generalizing and hybridizing count-based and neural language models. In EMNLP, Cited by: §2.1.
-  (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §3.2.
-  (2017) Open morphology of finnish. Note: LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University External Links: Cited by: §4.3.
-  (2007) A segmentation approach to morpheme analysis. Cited by: §2.4.
-  (2017) From characters to words to in between: do we capture morphology?. In ACL, Cited by: §1.
-  (2016) Unsupervised morph segmentation and statistical language models for vocabulary expansion. In ACL, Cited by: §5.
-  (2017) Character-word lstm language models. In EACL, Cited by: §5.
-  (2017) Word representation models for morphologically rich languages in neural machine translation. In SWCN@EMNLP, Cited by: §5.
-  (2012) Factored language model based on recurrent neural network. In Proceedings of COLING, Cited by: §5.
-  (2018) Breaking the softmax bottleneck: a high-rank rnn language model. CoRR abs/1711.03953. Cited by: §2.1.
-  (2009) Modeling morphologically rich languages using split words and unstructured dependencies. In ACL/IJCNLP, Cited by: §5.