Character language models (CLMs) are distributions over sequences of characters Sutskever et al. (2011), in contrast to traditional language models which are distributions over sequences of words. CLMs eliminate the need for a fixed word vocabulary, and modeling text at the character level gives the CLM access to subword information. These attributes suggest that CLMs can model regularities that exist within words, such as morphological inflection. However, even large language modeling (LM) corpora have sparse coverage of inflected forms for morphologically-rich languages, which has been shown to make word and character language modeling more difficult Gerz et al. (2018b); Cotterell et al. (2018). Because to this, we hypothesize that accurately modeling morphology improves language modeling, but that it is difficult for CLMs to learn this from text alone.
Motivated by this hypothesis, we add morphology supervision to character language modeling and show that, across two benchmark datasets, multitasking morphology with CLMs improves bits-per-character (BPC) performance on twenty-four languages, even when the annotated morphology features and language modeling data do not overlap. We also show that models augmented by multitasking achieve better BPC improvements on inflected forms than on uninflected forms, and that increasing the amount of language modeling data does not diminish the gains from morphology. Furthermore, to augment morphology annotations in low-resource languages, we also transfer morphology information between pairs of high- and low-resource languages. In this cross-lingual setting, we see that morphology supervision from the high-resource language improves BPC performance on the low-resource language over both the low-resource multitask model and over adding language modeling data from the high-resource language alone.
Given a sequence of characters
, our character-level language models calculate the probability ofc as
Each distribution is an LSTM Hochreiter and Schmidhuber (1997) trained such that at each time step , the model takes in a character
and estimates the probability of the next characteras
where is the previous hidden state of the LSTM, is the character embedding learned by the model for , and is a softmax over the character vocabulary space.
We calculate the loss function of our language modelas the negative log-likelihood of the model on the character sequence c:
We then evaluate the trained model’s performance with bits-per-character (BPC):
To add morphology features as supervision, we use a multitask learning (MTL) objective Collobert and Weston (2008) that combines loss functions for predicting different morphological tags with the language modeling objective. Since morphological features are annotated at the word-level, we convert these annotations to the character level by placing each annotated word’s tags as supervision on the first character (which we found to outperform supervising the last character in preliminary results).
This early placement allows the model to have access to the morphological features while decoding the rest of the characters in the word. Therefore, our morphology data is a sequence of labeled pairs in the form where is a character and is a set of morphology tags for that character. For example, “cats ran” would be given to our model as the sequence (‘c’, Number=Pl), (‘a’, -), (‘t’, -), (‘s’, -), (‘ ’, -), (‘r’, Tense=Past), (‘a’, -), (‘n’, -).
We modify the model’s loss function to
where is the number of morphological features we have annotated in a language, is a weighting parameter between the primary and auxiliary losses, is the original language modeling loss, and are the additional losses for each morphological feature (e.g., tense, number, etc). Because we include a separate loss for each morphological feature, each feature is predicted independently.
3 Experimental Setup
We obtain morphological annotations for 24 languages (Table 2) from Universal Dependencies (UD; v.2.3), which consists of dependency parsing treebanks with morphology annotations on a large number of languages Nivre et al. (2018). These languages were chosen based on the size of their treebanks (to ensure a sufficient amount of morphology annotations); we also exclude languages that do not have morphology features annotated in the treebank.
For language modeling supervision, we train two sets of models. One set is trained with the text from the UD treebanks; the other set of models is trained on the Multilingual Wikipedia Corpus (MWC) Kawakami et al. (2017). This language modeling dataset consists of Wikipedia data across seven languages (Czech, German, English, Spanish, Finnish, French, and Russian).
Our models each consist of a stacked LSTM with 1024 hidden dimensions and a character embedding layer of 512 dimensions. We include two hidden layers in the language models trained on UD, and three hidden layers in those trained on MWC. The parameters that integrate multitasking into the model (the layer at which we multitask morphology and the weighting we give the morphology losses,
) are tuned individually for each language. Further hyperparameter and training details are given in the supplement.
4 Language Modeling Results
We first train CLMs where the language modeling data (from MWC) and morphology data (from UD) do not overlap (Table 1).111Since both of these datasets draw from Wikipedia, we verified that no sentences overlap between the MWC test set and the UD treebanks for each of the seven languages. In this setting, we only train on the morphology features from UD and do not include this data as additional language modeling supervision. These models are trained on alternating batches from the two disjoint datasets. LM is a language modeling baseline with no multitask objective; MTL adds morphology supervision.
We find that for all seven languages, the MTL model outperforms our baseline trained only on MWC. Our model also outperforms the strongest model from Kawakami et al. (2017), HCLMcache, which is a hierarchical language model with caching. Thus, adding morphology supervision to our character language models allows us to achieve lower BPCs than a more complicated LM architecture. Surprisingly, we see a larger gain on languages with more LM data (EN, DE, ES, FR) than those with less data (but are considered to be more morphologically rich, e.g., CS, DE, and RU); we explore this phenomenon more in Section 5.
Fully Supervised MTL
We then train CLMs using UD for both langauge modeling and morphology supervision on more languages (Table 2). We again find that adding morphology supervision improves BPC. In general, we see smaller improvements between the LM and MTL models than under distant supervision, even though the UD LM data is fully annotated with morphology tags; this is likely due to the smaller training sets in UD (on average) than in MWC. On languages where the size of the two datasets are comparable, such as Russian and Czech, we see larger improvements on the fully supervised models than we do in the distant LM setting.
To investigate these results, we compare the rate of inflected words on the development set (which we use as a rough measure of morphological complexity of the language) in a language against BPC improvement by MTL model (Fig. 1). The rate at which each language is inflected is given in Table 2. We unexpectedly find that how much a language benefits from morphology supervision is only weakly correlated with the inflection rate of the language (r=0.15). This is surprising, because one would expect that additional morphological supervision would help languages that encode more morphological features in the forms (i.e., with higher inflection rates).
We then examine the effect of training dataset size on BPC improvement between the LM and the multitasked model (Fig. 1). We find that more training data (which adds both morphological and LM supervision) is strongly correlated with larger gains over the baseline LM (r=0.93). Therefore, it seems that any potential correlation between morphological complexity and the benefit of multitasking morphology is overwhelmed by differences in dataset size.
5 Analysis Experiments
Modeling Inflected Words
We hypothesized that morphology supervision would be most beneficial to words whose form is dependent on their morphology, e.g. inflected words. To investigate this, we calculate BPC of our UD models on inflected and uninflected forms in the UD development set. We determine whether or not a word is inflected by comparing it to the (annotated) lemma given in the UD treebank. We find that on 16 of the 24 languages for which we train models on UD, the MTL model improves more on inflected words than uninflected words, and that the average delta between LM and MTL models is 31% greater for inflected words than uninflected. A comparison of the improvements in six of these languages are given in Fig. 1. We show results for the agglutinative (ET, FI) and introflexive (AR, HE) languages and pick two fusional languages (EN, RU) against which to compare.
Effect of Training Data
One caveat to the observed gain from morphology is that the CLMs may capture this information if given more language modeling data, which is much cheaper to obtain than morphology annotations. To test this, we train CLMs on Czech (CS) and Russian (RU) on varied amounts of language modeling data from the MWC corpus (Table 3). We find that for both RU and CS, increasing the amount of LM data does not eliminate the gains we see from multitasking with morphology. Instead, we see that increasing LM data leads to larger improvements in the MTL model. Even when we train the CLMs on twice as much LM data (obtained from a larger version of the MWC dataset, MWC-large), we continue to see large improvements via multitasking.
We then investigate how the amount of annotated morphology data affects performance on these models (Table 3). We find that, as expected, increasing the amount of morphological data the language model is trained on improves BPC performance. For both Czech and Russian, the MTL models mulitasked with 25% or more of the annotated data still outperform the LM baseline, but MTL models trained on smaller subsets of the morphology data performed worse than the baseline. This is in line with our findings in Section 4 that the amount of annotated morphology data is closely tied with how much multitasking helps.
In the previous section, we showed that the amount of training data (both for LM and for morphology) the CLM sees is crucial for better performance. Motivated by this, we extend our models to the cross-lingual setting, in which we use data from high-resource languages to improve performance on closely-related, low-resource ones. We train models on the (high, low) language pairs of (Russian, Ukrainian) and (Czech, Slovak) and transfer both LM and morphological supervision (Table 3). We find the best performance for each low-resource language is achieved by using both the high-resource LM data and morphology annotations to augment the low-resource data. In Slovak (SK), this gives us a 0.333 BPC improvement over the MTL model on SK data alone, and in Ukranian (UK), we see a improvement of 0.032 in this setting over the MTL trained only on UK.
6 Related Work
Prior work has investigated to what degree neural models capture morphology when trained on language modeling Vania and Lopez (2017) and on machine translation Belinkov et al. (2017); Bisazza and Tump (2018). Other work has looked into how the architecture of language models can be improved for morphologically-rich languages Gerz et al. (2018a). In particular, both Kawakami et al. (2017) and Mielke and Eisner (2019) proposed hybrid open-vocabulary LM architectures to deal with rare words in morphologically-rich languages on MWC.222Results comparing against Mielke and Eisner (2019) are given in the supplement, due to a different character vocabulary from Kawakami et al. (2017).
Another line of work has investigated the use of morphology to improve models trained on other NLP tasks. These approaches add morphology as an input to the model, either with gold labels on the LM dataset Vania and Lopez (2017) or by labeling the data with a pretrained morphological tagger Botha and Blunsom (2014); Matthews et al. (2018). This approach to adding morphology as input features to models has also been applied to dependency parsers Vania et al. (2018) and semantic role labeling models Şahin and Steedman (2018). Unlike these approaches, however, our technique does not require the morphology data to overlap with the training data of the primary task or depend on automatically labeled features. More similarly to our work, Dalvi et al. (2017) find that incorporating morphological supervision into the decoder of an NMT system via multitasking improves performance by up to 0.58 BLEU points over the baseline for English-German, English-Czech, and German-English.
We incorporate morphological supervision into character language models via multitask learning and find that this addition improves BPC on 24 languages. Furthermore, we observe this gain even when the morphological annotations and language modeling data are disjoint, providing a simple way to improve language modelsing without requiring additional annotation efforts. Our analysis finds that the addition of morphology benefits inflected forms more than uninflected forms and that training our CLMs on additional language modeling data does not diminish these gains in BPC. Finally, we show that these gains can also be projected across closely related languages by sharing morphological annotations. We conclude that this multitasking approach helps the CLMs capture morphology better than the LM objective alone.
This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-1762114. We thank Victor Zhong, Sewon Min, and the anonymous reviewers for their helpful comments.
- Belinkov et al. (2017) Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 861–872.
Bisazza and Tump (2018)
Arianna Bisazza and Clara Tump. 2018.
The lazy encoder: A
fine-grained analysis of the role of morphology in neural machine
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2871–2876.
Botha and Blunsom (2014)
Jan Botha and Phil Blunsom. 2014.
morphology for word representations and language modelling.
International Conference on Machine Learning, pages 1899–1907.
- Collobert and Weston (2008) Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pages 160–167. ACM.
- Cotterell et al. (2018) Ryan Cotterell, Sebastian J. Mielke, Jason Eisner, and Brian Roark. 2018. Are all languages equally hard to language-model? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 536–541. Association for Computational Linguistics.
- Dalvi et al. (2017) Fahim Dalvi, Nadir Durrani, Hassan Sajjad, Yonatan Belinkov, and Stephan Vogel. 2017. Understanding and improving morphological learning in the neural machine translation decoder. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 142–151.
- Gerz et al. (2018a) Daniela Gerz, Ivan Vulić, Edoardo Ponti, Jason Naradowsky, Roi Reichart, and Anna Korhonen. 2018a. Language modeling for morphologically rich languages: Character-aware modeling for word-level prediction. Transactions of the Association of Computational Linguistics, 6:451–465.
- Gerz et al. (2018b) Daniela Gerz, Ivan Vulić, Edoardo Maria Ponti, Roi Reichart, and Anna Korhonen. 2018b. On the relation between linguistic typology and (limitations of) multilingual language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 316–327.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Kawakami et al. (2017) Kazuya Kawakami, Chris Dyer, and Phil Blunsom. 2017. Learning to create and reuse words in open-vocabulary neural language modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of 3rd International Conference of Learning Representations.
- Matthews et al. (2018) Austin Matthews, Graham Neubig, and Chris Dyer. 2018. Using morphological knowledge in open-vocabulary neural language models. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 1435–1445.
- Mielke and Eisner (2019) Sebastian J Mielke and Jason Eisner. 2019. Spell once, summon anywhere: A two-level open-vocabulary language model. In Proceedings of the Thirty-Third AAAI Conference on Artifical Intelligence.
- Nivre et al. (2018) Joakim Nivre et al. 2018. Universal dependencies 2.3. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
- Şahin and Steedman (2018) Gözde Gül Şahin and Mark Steedman. 2018. Character-level models versus morphology in semantic role labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.
- Sutskever et al. (2011) Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024.
- Vania et al. (2018) Clara Vania, Andreas Grivas, and Adam Lopez. 2018. What do character-level models learn about morphology? the case of dependency parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2573–2583. Association for Computational Linguistics.
- Vania and Lopez (2017) Clara Vania and Adam Lopez. 2017. From character to words to in between: Do we capture morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2016––2027. Association for Computational Linguistics.
Appendix A Appendix: Languages and Datasets
The languages we use from Universal Dependencies and details about their treebanks are given in Table 4. Most of the treebanks we used in this paper are manually annotated (and then possibly automatically converted to their current format), except for German, English, and French, which are automatically annotated. For models trained in the fully-supervised MTL setting where UD is used for both LM and morphology supervision, we calculate the character vocabulary for each language by including any character that occurs more than 5 times in the training set of the language’s UD treebank.
Dataset statistics for the Multilingual Wikipedia Corpus (MWC) are given in Table 5. When analyzing the effect of LM training dataset size on Czech and Russian, we also train models on the training portion of a larger version of the MWC corpus, MWC-large, which contains approximately twice as much training data as the standard MWC dataset. Specifically, MWC-large contains 10.2M training characters for Czech and 18.2M for Russian. There is no prior work that we know of that reports BPC on this larger dataset.
For models trained on the disjoint supervision setting, we use the character vocabulary provided for each language in the MWC dataset (see Kawakami et al. (2017) for preprocessing details). In cases where we use two sources of supervision for the model – LM supervision from MWC and morphology supervision from UD – we use the MWC character vocabulary for all inputs, so that BPC results across models are comparable. This only affects a small number of the character types (11 or fewer for each language) in the UD training data.
The character vocabulary provided in the MWC dataset and used for the distant supervision setting differs from the vocabulary calculated by including the characters that occur more than 25 times in the MWC training set.333On English, this preprocessing difference decreases the character vocabulary size from 307 in the provided vocabulary to 167. Because of this, our distant supervision setting on MWC is not comparable with Mielke and Eisner (2019), which uses the second vocabulary setting. Therefore, we retrain our character LM baselines and multitasked models in this vocabulary setting (Table 6). We find that our LM and MTL models generally obtain slightly better performance on this setting, and we continue to see improvement from multitasking morphology over the character LM baseline.
Appendix B Appendix: Model Parameters and Training
To train all models presented in this paper, we use the Adam optimizer Kingma and Ba (2015)
with an initial learning rate of 0.002 and clip the norm of the gradient to 5. We also apply a dropout of 0.5 to each layer. We train each model on sequences of 150 characters and use early stopping with a patience of 10. We only use the language modeling performance (BPC) on the development set for early stopping and hyperparameter selection (and do not consider the morphology losses). For the UD language models, we train models with two hidden layers for 150 epochs with a batch size of 10. The models trained on MWC contain three hidden layers and are trained for 250 epochs with a batch size of 32. All of our models are implemented in Pytorch.444https://pytorch.org/
|Lang||MTL layer||MTL layer|
For each language, we individually tuned the level at which we multitask the morphology objectives and the weighting ratio between the primary and auxiliary losses . We consider multitasking the morphology objective at either the first or second hidden layer (as all of our models have two hidden layers), and tune for each language . The parameters chosen for each language and setting (fully supervised or distant MTL) are given in Table 7.
Appendix C Appendix: Additional Results