1 Introduction
The majority of the world’s languages overtly encodes syntactic information on the word form itself, a phenomenon termed inflectional morphology Dryer et al. (2005). In English, for example, the verbal lexeme with lemma talk has the four forms: talk, talks, talked and talking. Other languages, such as Archi Kibrik (1998)
, distinguish more than a thousand verbal forms. Despite the cornucopia of unique variants a single lexeme may mutate into, native speakers can flawlessly predict the correct variant that the lexeme’s syntactic context dictates. Thus, in computational linguistics, a natural question is the following: Can we estimate a probability model that can do the same?
The topic of inflection generation has been the focus of a flurry of individual attention of late and, moreover, has been the subject of two shared tasks Cotterell et al. (2016, 2017). Most work, however, has focused on the fully supervised case—a source lemma and the morphosyntactic properties are fed into a model, which is asked to produce the desired inflection. In contrast, our work focuses on the semisupervised case, where we wish to make use of unannotated raw text, i.e., a sequence of inflected tokens.
width=1.
and overlayed with example values of the random variables in the sequence. We highlight that all the conditionals in the Bayesian network are recurrent neural networks, e.g., we note that
depends on because we employ a recurrent neural network to model the morphological tag sequence.Concretely, we develop a generative directed graphical model of inflected forms in context. A contextual inflection model works as follows: Rather than just generating the proper inflection for a single given word form out of context (for example walking as the gerund of walk), our generative model is actually a fullyfledged language model. In other words, it generates sequences of inflected words. The graphical model is displayed in Fig. 1
and examples of words it may generate are pasted on top of the graphical model notation. That our model is a language model enables it to exploit both inflected lexicons and unlabeled raw text in a principled semisupervised way. In order to train using rawtext corpora (which is useful when we have less annotated data), we marginalize out the unobserved lemmata and morphosyntactic annotation from unlabeled data. In terms of
Fig. 1, this refers to marginalizing out and . As this marginalization is intractable, we derive a variational inference procedure that allows for efficient approximate inference. Specifically, we modify the wakesleep procedure of hinton1995wake. It is the inclusion of raw text in this fashion that makes our model token level, a novelty in the camp of inflection generation, as much recent work in inflection generation Dreyer et al. (2008); Durrett and DeNero (2013); Nicolai et al. (2015); Ahlberg et al. (2015); Faruqui et al. (2016), trains a model on typelevel lexicons.We offer empirical validation of our model’s utility with experiments on 23 languages from the Universal Dependencies corpus in a simulated lowresource setting.^{1}^{1}1We make our code and data available at: https://github.com/lwolfsonkin/morphsvae. Our semisupervised scheme improves inflection generation by over 10% absolute accuracy in some cases.
sg  pl  sg  pl  

nom  Wort  Wörter  Herr  Herren 
gen  Wortes  Wörter  Herrn  Herren 
acc  Wort  Wörter  Herrn  Herren 
dat  Worte  Wörtern  Herrn  Herren 
2 Background: Morphological Inflection
2.1 Inflectional Morphology
To properly discuss models of inflectional morphology, we require a formalization. We adopt the framework of wordbased morphology Aronoff (1976); Spencer (1991). Note in the present paper, we omit derivational morphology.
We define an inflected lexicon as a set of 4tuples consisting of a partofspeech tag, a lexeme, an inflectional slot, and a surface form. A lexeme is a discrete object that indexes the word’s core meaning and part of speech. In place of such an abstract lexeme, lexicographers will often use a lemma, denoted by , which is a designated^{2}^{2}2A specific slot of the paradigm is chosen, depending on the partofspeech tag – all these terms are defined next. surface form of the lexeme (such as the infinitive). For the remainder of this paper, we will use the lemma as a proxy for the lexeme, wherever convenient, although we note that lemmata may be ambiguous: bank is the lemma for at least two distinct nouns and two distinct verbs. For inflection, this ambiguity will rarely^{3}^{3}3One example of a paradigm where the lexeme, rather than the lemma, may influence inflection is hang. If one chooses the lexeme that licenses animate objects, the proper past tense is hanged , whereas it is hung for the lexeme that licenses inanimate objects. play a role—for instance, all senses of bank inflect in the same fashion.
A partofspeech (POS) tag, denoted , is a coarse syntactic category such as Verb. Each POS tag allows some set of lexemes, and also allows some set of inflectional slots, denoted as , such as . Each allowed triple is realized—in only one way—as an inflected surface form, a string over a fixed phonological or orthographic alphabet . (In this work, we take to be an orthographic alphabet.) Additionally, we will define the term morphological tag, denoted by , which we take to be the POSslot pair . We will further define as the set of all POS tags and as the set of all morphological tags.
A paradigm is the mapping from tag ’s slots to the surface forms that “fill” those slots for lexeme/lemma . For example, in the English paradigm , the pasttense slot is said to be filled by talked, meaning that the lexicon contains the tuple .
A cheat sheet for the notation is provided in Tab. 2.
We will specifically work with the UniMorph annotation scheme SylakGlassman (2016). Here, each slot specifies a morphosyntactic bundle of inflectional features such as tense, mood, person, number, and gender. For example, the German surface form Wörtern is listed in the lexicon with tag Noun, lemma Wort, and a slot specifying the feature bundle . The full paradigms and are found in Tab. 1.
2.2 Morphological Inflection
Now, we formulate the task of contextfree morphological inflection using the notation developed in § 2. Given a set of formtaglemma triples , the goal of morphological inflection is to map the pair to the form . As the definition above indicates, the task is traditionally performed at the type level. In this work, however, we focus on a generalization of the task to the token level—we seek to map a bisequence of lemmatag pairs to the sequence of inflected forms in context. Formally, we will denote the lemmamorphological tag bisequence as and the form sequence as . Foreshadowing, the primary motivation for this generalization is to enable the use of rawtext in a semisupervised setting.
3 Generating Sequences of Inflections
The primary contribution of this paper is a novel generative model over sequences of inflected words in their sentential context. Following the notation laid out in § 2.2, we seek to jointly learn a distribution over sequences of forms , lemmata , and morphological tags . The generative procedure is as follows: First, we sample a sequence of tags , each morphological tag coming from a language model over morphological tags: . Next, we sample the sequence of lemmata given the previously sampled sequence of tags — these are sampled conditioned only on the corresponding morphological tag: . Finally, we sample the sequence of inflected words , where, again, each word is chosen conditionally independent of other elements of the sequence: .^{4}^{4}4Note that we denote all three distributions as to simplify notation and emphasize the joint modeling aspect; context will always resolve the ambiguity in this paper. We will discuss their parameterization in § 4.
This yields the factorized joint distribution:
(1)  
We depict the corresponding directed graphical model in Fig. 1.
Relation to Other Models in NLP.
As the graphical model drawn in Fig. 1
shows, our model is quite similar to a Hidden Markov Model (HMM)
Rabiner (1989). There are two primary differences. First, we remark that an HMM directly emits a form conditioned on the tag . Our model, in contrast, emits a lemma conditioned on the morphological tag and, then, conditioned on both the lemma and the tag , we emit the inflected form . In this sense, our model resembles the hierarchical HMM of fine1998hierarchical with the difference that we do not have interdependence between the lemmata . The second difference is that our model is nonMarkovian: we sample the morphological tag from a distribution that depends on all previous tags, using an LSTM language model (§ 4.1). This yields richer interactions among the tags, which may be necessary for modeling longdistance agreement phenomena.Why a Generative Model?
What is our interest in a generative model of inflected forms? Eq. 1 is a syntaxonly language model in that it only allows for interdependencies between the morphosyntactic tags in . However, given a tag sequence , the individual lemmata and forms are conditionally independent. This prevents the model from learning notions such as semantic frames and topicality. So what is this model good for? Our chief interest is the ability to train a morphological inflector on unlabeled data, which is a boon in a lowresource setting. As the model is generative, we may consider the latentvariable model:
(2) 
where we marginalize out the latent lemmata and morphological tags from raw text. The sum in Eq. 2 is unabashedly intractable—given a sequence , it involves consideration of an exponential (in ) number of tag sequences and an infinite number of lemmata sequences. Thus, we will fall back on an approximation scheme (see § 5).
4 Recurrent Neural Parameterization
The graphical model from § 3 specifies a family of models that obey the conditional independence assumptions dictated by the graph in Fig. 1
. In this section we define a specific parameterization using long shortterm memory (LSTM) recurrent neural network
Hochreiter and Schmidhuber (1997) language models Sundermeyer et al. (2012).4.1 LSTM Language Models
Before proceeding, we review the modeling of sequences with LSTM language models. Given some alphabet , the distribution over sequences can be defined as follows:
(3) 
where . The prediction at time step of a single element is then parametrized by a neural network:
(4) 
where and are learned parameters (for some number of hidden units ) and the hidden state is defined through the recurrence given by Hochreiter and Schmidhuber (1997) from the previous hidden state and an embedding of the previous character (assuming some learned embedding function for some number of dimensions ):
(5) 
4.2 Our Conditional Distributions
We discuss each of the factors in Eq. 1 in turn.
Morphological Tag Language Model: .
We define as an LSTM language model, as defined in § 4.1, where we take , i.e., the elements of the sequence that are to be predicted are tags like . Note that the embedding function
does not treat them as atomic units, but breaks them up into individual attributevalue pairs that are embedded individually and then summed to yield the final vector representation. To be precise, each tag is first encoded by a multihot vector, where each component corresponds to a attributevalue pair in the slot, and then this multihot vector is multiplied with an embedding matrix.
Lemma Generator: .
The next distribution in our model is a lemma generator which we define to be a conditional LSTM language model over characters (we take ), i.e., each is a single (orthographic) character. The language model is conditioned on (the partofspeech information contained in the morphological tag ), which we embed into a lowdimensional space and feed to the LSTM by concatenating its embedding with that of the current character. Thusly, we obtain the new recurrence relation for the hidden state:
(6) 
where denotes the character of the generated lemma and for some is a learned embedding function for POS tags. Note that we embed only the POS tag, rather than the entire morphological tag, as we assume the lemma depends on the part of speech exclusively.
Morphological Inflector: .
The final conditional in our model is a morphological inflector, which we parameterize as a neural recurrent sequencetosequence model Sutskever et al. (2014) with Luong dotstyle attention Luong et al. (2015). Our particular model uses a single encoderdecoder architecture (Kann and Schütze, 2016) for all tag pairs within a language and we refer to reader to that paper for further details. Concretely, the encoder runs over a string consisting of the desired slot and all characters of the lemma that is to be inflected (e.g. <w> V PST t a l k </w>), one LSTM running lefttoright, the other righttoleft. Concatenating the hidden states of both RNNs at each time step results in hidden states . The decoder, again, takes the form of an LSTM language model (we take ), producing the inflected form character by character, but at each time step not only the previous hidden state and the previously generated token are considered, but attention (a convex combination) over all encoder hidden states , with the distribution given by another neural network; see luongphammanning:2015:EMNLP.
5 SemiSupervised WakeSleep
We train the model with the wakesleep procedure, which requires us to perform posterior inference over the latent variables. However, the exact computation in the model is intractable—it involves a sum over all possible lemmatizations and taggings of the sentence, as shown in Eq. 2. Thus, we fall back on a variational approximation Jordan et al. (1999). We train an inference network that approximates the true posterior over the latent variables .^{5}^{5}5Inference networks are also known as stochastic inverses Stuhlmüller et al. (2013) or recognition models Dayan et al. (1995). The variational family we choose in this work will be detailed in § 5.5. We fit the distribution using a semisupervised extension of the wakesleep algorithm Hinton et al. (1995); Dayan et al. (1995); Bornschein and Bengio (2014). We derive the algorithm in the following subsections and provide pseudocode in Alg. 1.
Note that the wakesleep algorithm shows structural similarities to the expectationmaximization (EM) algorithm
(Dempster et al., 1977), and, presaging the exposition, we note that the wakesleep procedure is a type of variational EM Beal (2003). The key difference is that the Estep minimizes an inclusive KL divergence, rather than the exclusive one typically found in variational EM.5.1 Data Requirements of WakeSleep
We emphasize again that we will train our model in a semisupervised fashion. Thus, we will assume a set of labeled sentences, , represented as a set of triples , and a set of unlabeled sentences, , represented as a set of surface form sequences .
5.2 The Sleep Phase
Wakesleep first dictates that we find an approximate posterior distribution that minimizes the KL divergences for all form sequences:
(7) 
with respect to the parameters , which control the variational approximation . Because is trained to be a variational approximation for any input , it is called an inference network. In other words, it will return an approximate posterior over the latent variables for any observed sequence. Importantly, note that computation of Eq. 7 is still hard—it requires us to normalize the distribution , which, in turn, involves a sum over all lemmatizations and taggings. However, it does lend itself to an efficient Monte Carlo approximation. As our model is fully generative and directed, we may easily take samples from the complete joint. Specifically, we will take samples by forward sampling and define them as . We remark that we use a tilde to indicate that a form, lemmata or tag is sampled, rather than human annotated. Using samples, we obtain the objective
(8) 
which we could maximize by fitting the model
through backpropagation
Rumelhart et al. (1986), as one would during maximum likelihood estimation.5.3 The Wake Phase
Now, given our approximate posterior , we are in a position to reestimate the parameters of the generative model . Given a set of unannotated sentences , we again first consider the objective
(9) 
where is a set of triples with and , maximizing with respect to the parameters (we may stochastically backprop through the expectation simply by backpropagating through this sum). Note that Eq. 9 is a Monte Carlo approximation of the inclusive divergence of the data distribution of times with .
5.4 Adding Supervision to WakeSleep
So far we presented a purely unsupervised training method that makes no assumptions about the latent lemmata and morphological tags. In our case, however, we have a very clear idea what the latent variables should look like. For instance, we are quite certain that the lemma of talking is talk and that it is in fact a gerund. And, indeed, we have access to annotated examples in the form of an annotated corpus. In the presence of these data, we optimize the supervised sleep phase objective,
(10) 
which is a Monte Carlo approximation of . Thus, when fitting our variational approximation , we will optimize a joint objective , where , to repeat, uses actual annotated lemmata and morphological tags; we balance the two parts of the objective with a scaling parameter . Note that on the first sleep phase iteration, we set since taking samples from an untrained when we have available labeled data is of little utility. We will discuss the provenance of our data in § 7.2.
Likewise, in the wake phase we can neglect the approximation in favor of the annotated latent variables found in ; this leads to the following supervised objective
(11) 
which is a Monte Carlo approximation of . As in the sleep phase, we will maximize , where is, again, a scaling parameter.
5.5 Our Variational Family
How do we choose the variational family ? In terms of NLP nomenclature, represents a joint morphological tagger and lemmatizer. The opensource tool lemming Müller et al. (2015) represents such an object. lemming is a higherorder linearchain conditional random field (CRF; Lafferty et al., 2001), that is an extension of the morphological tagger of muellerschmidschutze:2013:EMNLP. Interestingly, lemming is a linear model that makes use of simple character gram feature templates. On both the tasks of morphological tagging and lemmatization, neural models have supplanted linear models in terms of performance in the highresource case Heigold et al. (2017). However, we are interested in producing an accurate approximation to the posterior in the presence of minimal annotated examples and potentially noisy samples produced during the sleep phase, where linear models still outperform nonlinear approaches Cotterell and Heigold (2017). We note that our variational approximation is compatible with any family.
5.6 Interpretation as an Autoencoder
We may also view our model as an autoencoder, following kingma2013auto, who saw that a variational approximation to any generative model naturally has this interpretation. The crucial difference between kingma2013auto and this work is that our model is a
structured variational autoencoder in the sense that the space of our latent code is structured: the inference network encodes a sentence into a pair of lemmata and morphological tags . This bisequence is then decoded back into the sequence of forms through a morphological inflector. The reason the model is called an autoencoder is that we arrive at an autoencodinglike objective if we combine the and as so:(12) 
where is a copy of the original sentence .
Note that this choice of latent space sadly precludes us from making use of the reparametrization trick that makes inference in VAEs particularly efficient. In fact, our whole inference procedure is quite different as we do not perform gradient descent on both and jointly but alternatingly optimize both (using wakesleep). We nevertheless call our model a VAE to uphold the distinction between the VAE as a model (essentially a specific Helmholtz machine Dayan et al. (1995), justified by variational inference) and the endtoend inference procedure that is commonly used.
Another way of viewing this model is that it tries to force the words in the corpus through a syntactic bottleneck. Spiritually, our work is close to the conditional random field autoencoder of ammar2014conditional.
We remark that many other structured NLP tasks can be “autoencoded” in this way and, thus, trained by a similar wakesleep procedure. For instance, any two tasks that effectively function as inverses, e.g., translation and backtranslation, or language generation and parsing, can be treated with a similar variational autoencoder. While this work only focuses on the creation of an improved morphological inflector , one could imagine a situation where the encoder was also a task of interest. That is, the goal would be to improve both the decoder (the generation model) and the encoder (the variational approximation).
6 Related Work
Closest to our work is zhouneubig:2017:Long, who describe an unstructured variational autoencoder. However, the exact use case of our respective models is distinct. Our method models the syntactic dynamics with an LSTM language model over morphological tags. Thus, in the semisupervised setting, we require tokenlevel annotation. Additionally, our latent variables are interpretable as they correspond to wellunderstood linguistic quantities. In contrast, zhouneubig:2017:Long infer latent lemmata as real vectors. To the best of our knowledge, we are only the second attempt, after zhouneubig:2017:Long, to attempt to perform semisupervised learning for a neural inflection generator. Other nonneural attempts at semisupervised learning of morphological inflectors include huldenforsbergahlberg:2014:EACL. Models in this vein are nonneural and often focus on exploiting corpus statistics, e.g., token frequency, rather than explicitly modeling the forms in context. All of these approaches are designed to learn from a typelevel lexicon, rendering direct comparison difficult.
7 Experiments
While we estimate all the parameters in the generative model, the purpose of this work is to improve the performance of morphological inflectors through semisupervised learning with the incorporation of unlabeled data.
7.1 LowResource Inflection Generation
The development of our method was primarily aimed at the lowresource scenario, where we observe a limited number of annotated data points. Why lowresource? When we have access to a preponderance of data, morphological inflection is close to being a solved problem, as evinced in SIGMORPHON’s 2016 shared task. However, the CoNLLSIGMORPHON 2017 shared task showed there is much progress to be made in the lowresource case. Semisupervision is a clear avenue.
7.2 Data
As our model requires tokenlevel morphological annotation, we perform our experiments on the Universal Dependencies (UD) dataset Nivre et al. (2017). As this stands in contrast to most work on morphological inflection (which has used the UniMorph SylakGlassman et al. (2015)^{6}^{6}6The two annotation schemes are similar. For a discussion, we refer the reader to http://universaldependencies.org/v2/features.html; sadly there are differences that render all numbers reported in this work incomparable with previous work, see § 7.4. datasets), we use a converted version of UD data, in which the UD morphological tags have been deterministically converted into UniMorph tags.
For each of the treebanks in the UD dataset, we divide the training portion into three chunks consisting of the first 500, 1000 and 5000 tokens, respectively. These labeled chunks will constitute three unique sets . The remaining sentences in the training portion will be used as unlabeled data for each language, i.e., we will discard those labels. The development and test portions will be left untouched.
Languages.
We explore a typologically diverse set of languages of various stocks: IndoEuropean, AfroAsiatic, Turkic and FinnoUgric, as well as the language isolate Basque. We have organized our experimental languages in Tab. 3 by genetic grouping, highlighting subfamilies where possible. The IndoEuropean languages mostly exhibit fusional morphologies of varying degrees of complexity. The Basque, Turkic, and FinnoUgric languages are agglutinative. Both of the AfroAsiatic languages, Arabic and Hebrew, are Semitic and have templatic morphology with fusional affixes.
7.3 Evaluation
The end product of our procedure is a morphological inflector, whose performance is to be improved through the incorporation of unlabeled data. Thus, we evaluate using the standard metric accuracy. We will evaluate at the type level, as is traditional in the morphological inflection literature, even though the UD treebanks on which we evaluate are tokenlevel resources. Concretely, we compile an incomplete typelevel morphological lexicon from the tokenlevel resource. To create this resource, we gather all unique formlemmatag triples present in the UD test data.^{7}^{7}7Some of these formlemmatag triples will overlap with those seen in the training data.
7.4 Baselines
As mentioned before, most work on morphological inflection has considered the task of estimating statistical inflectors from typelevel lexicons. Here, in contrast, we require tokenlevel annotation to estimate our model. For this reason, there is neither a competing approach whose numbers we can make a fair comparison to nor is there an opensource system we could easily run in the tokenlevel setting. This is why we treat our tokenlevel data as a list of “types”^{8}^{8}8Typical typebased inflection lexicons are likely not i.i.d. samples from natural utterances, but we have no other choice if we want to make use of only our tokenlevel data and not additional resources like frequency and regularity of forms. and then use two simple typebased baselines.
First, we consider the probabilistic finitestate transducer used as the baseline for the CoNLLSIGMORPHON 2017 shared task.^{9}^{9}9https://sites.google.com/view/conllsigmorphon2017/
We consider this a relatively strong baseline, as we seek to generalize from a minimal amount of data. As described by cotterellconllsigmorphon2017, the baseline performed quite competitively in the task’s lowresource setting. Note that the finitestate machine is created by heuristically extracting prefixes and suffixes from the word forms, based on an unsupervised alignment step. The second baseline is our neural inflector
given in § 4 without the semisupervision; this model is stateoftheart on the highresource version of the task.We will refer to our baselines as follows: FST is the probabilistic transducer, NN is the neural sequencetosequence model without semisupervision, and SVAE is the structured variational autoencoder, which is equivalent to NN but also trained using wakesleep and unlabeled data.
7.5 Results
We ran the three models on 23 languages with the hyperparameters and experimental details described in
App. A. We present our results in Fig. 2 and in Tab. 3. We also provide sample output of the generative model created using the dream step in App. B. The highlevel takeaway is that on almost all languages we are able to exploit the unlabeled data to improve the sequencetosequence model using unlabeled data, i.e., SVAE outperforms the NN model on all languages across all training scenarios. However, in many cases, the FST model is a better choice—the FST can sometimes generalize better from a handful of supervised examples than the neural network, even with semisupervision (SVAE). We highlight three finergrained observations below.Observation 1: FST Good in LowResource.
As clearly evinced in Fig. 2
, the baseline FST is still competitive with the NN, or even our SVAE when data is extremely scarce. Our neural architecture is quite general, and lacks the prior knowledge and inductive biases of the rulebased system, which become more pertinent in lowresource scenarios. Even though our semisupervised strategy clearly improves the performance of NN, we cannot always recommend SVAE for the case when we only have 500 annotated tokens, but on average it does slightly better. The SVAE surpasses the FST when moving up to 1000 annotated tokens, becoming even more pronounced at 5000 annotated tokens.
Observation 2: Agglutinative Languages.
The next trend we remark upon is that languages of an agglutinating nature tend to benefit more from the semisupervised learning. Why should this be? Since in our experimental setup, every language sees the same number of tokens, it is naturally harder to generalize on languages that have more distinct morphological variants. Also, by the nature of agglutinative languages, relevant morphemes could be arbitrarily far from the edges of the string, making the (NN and) SVAE’s ability to learn more generic rules even more valuable.
Observation 3: Nonconcatenative Morphology.
One interesting advantage that the neural models have over the FSTs is the ability to learn nonconcatenative phenomena. The FST model is based on prefix and suffix rewrite rules and, naturally, struggles when the correctly reinflected form is more than the concatenation of these parts. Thus we see that for the two semitic language, the SVAE is the best method across all resource settings.
8 Conclusion
We have presented a novel generative model for morphological inflection generation in context. The model allows us to exploit unlabeled data in the training of morphological inflectors. As the model’s rich parameterization prevents tractable inference, we craft a variational inference procedure, based on the wakesleep algorithm, to marginalize out the latent variables. Experimentally, we provide empirical validation on 23 languages. We find that, especially in the lowerresource conditions, our model improves by large margins over the baselines.
References
 Ahlberg et al. (2015) Malin Ahlberg, Markus Forsberg, and Mans Hulden. 2015. Paradigm classification in supervised learning of morphology. In Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, pages 1024–1029, Denver, CO. Association for Computational Linguistics.
 Ammar et al. (2014) Waleed Ammar, Chris Dyer, and Noah A. Smith. 2014. Conditional random field autoencoders for unsupervised structured prediction. In Advances in Neural Information Processing Systems, pages 3311–3319.
 Aronoff (1976) Mark Aronoff. 1976. Word Formation in Generative Grammar. Number 1 in Linguistic Inquiry Monographs. MIT Press, Cambridge, MA.

Beal (2003)
Matthew James Beal. 2003.
Variational Algorithms for Approximate Bayesian Inference
. University College London.  Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
 Bornschein and Bengio (2014) Jörg Bornschein and Yoshua Bengio. 2014. Reweighted wakesleep. CoRR, abs/1406.2751.

Cotterell and Heigold (2017)
Ryan Cotterell and Georg Heigold. 2017.
Crosslingual
characterlevel neural morphological tagging.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages 748–759, Copenhagen, Denmark. Association for Computational Linguistics.  Cotterell et al. (2017) Ryan Cotterell, Christo Kirov, John SylakGlassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui, Sandra Kübler, David Yarowsky, Jason Eisner, and Mans Hulden. 2017. The CoNLLSIGMORPHON 2017 shared task: Universal morphological reinflection in 52 languages. In Proceedings of the CoNLLSIGMORPHON 2017 Shared Task: Universal Morphological Reinflection, Vancouver, Canada. Association for Computational Linguistics.
 Cotterell et al. (2016) Ryan Cotterell, Christo Kirov, John SylakGlassman, David Yarowsky, Jason Eisner, and Mans Hulden. 2016. The SIGMORPHON 2016 shared task—morphological reinflection. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 10–22, Berlin, Germany. Association for Computational Linguistics.
 Dayan et al. (1995) Peter Dayan, Geoffrey E. Hinton, Radford M. Neal, and Richard S. Zemel. 1995. The Helmholtz machine. Neural Computation, 7(5):889–904.
 Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38.
 Dreyer et al. (2008) Markus Dreyer, Jason Smith, and Jason Eisner. 2008. Latentvariable modeling of string transductions with finitestate methods. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1080–1089, Honolulu, Hawaii. Association for Computational Linguistics.
 Dryer et al. (2005) Matthew S. Dryer, David Gil, Bernard Comrie, Hagen Jung, Claudia Schmidt, et al. 2005. The world atlas of language structures.
 Durrett and DeNero (2013) Greg Durrett and John DeNero. 2013. Supervised learning of complete morphological paradigms. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1185–1195, Atlanta, Georgia. Association for Computational Linguistics.
 Faruqui et al. (2016) Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and Chris Dyer. 2016. Morphological inflection generation using character sequence to sequence learning. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 634–643, San Diego, California. Association for Computational Linguistics.
 Fine et al. (1998) Shai Fine, Yoram Singer, and Naftali Tishby. 1998. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32(1):41–62.
 Heigold et al. (2017) Georg Heigold, Guenter Neumann, and Josef van Genabith. 2017. An extensive empirical evaluation of characterbased morphological tagging for 14 languages. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 505–513, Valencia, Spain. Association for Computational Linguistics.
 Hinton et al. (1995) Geoffrey E. Hinton, Peter Dayan, Brendan J. Frey, and Radford M. Neal. 1995. The "wakesleep" algorithm for unsupervised neural networks. Science, 268(5214):1158–1161.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural Computation, 9(8):1735–1780.
 Hulden et al. (2014) Mans Hulden, Markus Forsberg, and Malin Ahlberg. 2014. Semisupervised learning of morphological paradigms and lexicons. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 569–578, Gothenburg, Sweden. Association for Computational Linguistics.
 Jordan et al. (1999) Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. 1999. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233.
 Kann and Schütze (2016) Katharina Kann and Hinrich Schütze. 2016. Singlemodel encoderdecoder with explicit morphological representation for reinflection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pages 555–560, Berlin, Germany. Association for Computational Linguistics.
 Kibrik (1998) Aleksandr E. Kibrik. 1998. Archi. In Andrew Spencer and Arnold M. Zwicky, editors, The Handbook of Morphology, pages 455–476.
 Kingma and Welling (2013) Diederik P. Kingma and Max Welling. 2013. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114.
 Lafferty et al. (2001) John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 282–289.
 Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics.
 Müller et al. (2015) Thomas Müller, Ryan Cotterell, Alexander Fraser, and Hinrich Schütze. 2015. Joint lemmatization and morphological tagging with Lemming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2268–2274, Lisbon, Portugal. Association for Computational Linguistics.
 Müller et al. (2013) Thomas Müller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient higherorder CRFs for morphological tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 322–332, Seattle, Washington, USA. Association for Computational Linguistics.
 Nicolai et al. (2015) Garrett Nicolai, Colin Cherry, and Grzegorz Kondrak. 2015. Inflection generation as discriminative string transduction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 922–931, Denver, Colorado. Association for Computational Linguistics.

Nivre et al. (2017)
Joakim Nivre, Željko Agić, Lars Ahrenberg, Maria Jesus Aranzabe,
Masayuki Asahara, Aitziber Atutxa, Miguel Ballesteros, John Bauer, Kepa
Bengoetxea, Riyaz Ahmad Bhat, Eckhard Bick, Cristina Bosco, Gosse Bouma, Sam
Bowman, Marie Candito, Gülşen Cebiroğlu Eryiğit, Giuseppe
G. A. Celano, Fabricio Chalub, Jinho Choi, Çağrı Çöltekin, Miriam Connor, Elizabeth Davidson, MarieCatherine
de Marneffe, Valeria de Paiva, Arantza Diaz de Ilarraza, and Dobrovoljc.
2017.
Universal dependencies
2.0.
LINDAT/CLARIN digital library at the Institute of Formal and
Applied Linguistics (
’UFAL), Faculty of Mathematics and Physics, Charles University.  Rabiner (1989) Lawrence R. Rabiner. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286.
 Rumelhart et al. (1986) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science.
 Spencer (1991) Andrew Spencer. 1991. Morphological Theory: An Introduction to Word Structure in Generative Grammar. WileyBlackwell.
 Stuhlmüller et al. (2013) Andreas Stuhlmüller, Jacob Taylor, and Noah Goodman. 2013. Learning stochastic inverses. In Advances in Neural Information Processing Systems, pages 3048–3056.
 Sundermeyer et al. (2012) Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Thirteenth Annual Conference of the International Speech Communication Association.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pages 3104–3112.
 SylakGlassman (2016) John SylakGlassman. 2016. The composition and use of the universal morphological feature schema (Unimorph schema). Technical report, Johns Hopkins University.
 SylakGlassman et al. (2015) John SylakGlassman, Christo Kirov, David Yarowsky, and Roger Que. 2015. A languageindependent feature schema for inflectional morphology. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL), pages 674–680, Beijing, China. Association for Computational Linguistics.
 Zeiler (2012) Matthew D. Zeiler. 2012. Adadelta: An adaptive learning rate method. arXiv preprint:1212.5701.
 Zhou and Neubig (2017) Chunting Zhou and Graham Neubig. 2017. Multispace variational encoderdecoders for semisupervised labeled sequence transduction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 310–320, Vancouver, Canada. Association for Computational Linguistics.
Appendix A Hyperparameters and Experimental Details
Here, we list all the hyperparameters and other experimental details necessary for the reproduction of the numbers presented in Tab. 3. The final experiments were produced with the follow setting. We performed a modest grid search over various configurations in the search of the best option on development for each component.
LSTM Morphological Tag Language Model.
The morphological tag language model is a layer vanilla LSTM trained with hidden size of . It is trained to for epochs using SGD with a cross entropy loss objective, and an initial learning rate of where the learning rate is quartered during any epoch where the loss on the validation set reaches a new minimum. We regularize using dropout of and clip gradients to . The morphological tags are embedded (both for input and output) with a multihot encoding into , where any given tag has an embedding that is the sum of the embedding for its constituent POS tag and each of its constituent slots.
Lemmata Generator.
The lemma generator is a singlelayer vanilla LSTM, trained for 10000 epochs using SGD with a learning rate of 4, using a batch size of 20000. The LSTM has 50 hidden units, embeds the POS tags into and each token (i.e., character) into . We regularize using weight decay (1e6), no dropout, and clip gradients to 1. When sampling lemmata from the model, we cool the distribution using a temperature of 0.75 to generate more “conservative” values. The hyperparameters were manually tuned on Latin data to produce sensible output and fit development data and then reused for all languages of this paper.
Morphological Inflector.
The reinflection model is a singlelayer GRUcell seq2seq model with a bidirectional encoder and multiplicative attention in the style of luongphammanning:2015:EMNLP, which we train for 250 iterations of AdaDelta (Zeiler, 2012). Our search over the remaining hyperparameters was as follows (optimal values in bold): input embedding size of 50, 100, 200, 300 , hidden size of 50, 100, 150, 200, and a dropout rate of 0.0, 0.1, 0.2, 0.3, 0.4, 0.5.
Lemmatizer and Morphological Tagger.
The joint lemmatizer and tagger is lemming as described in § 5.5. It is trained with default parameters, the pretrained word vectors from bojanowski2016enriching as type embeddings, and beam size .
WakeSleep
We run two iterations () of wakesleep. Note that each of the subparts of wakesleep: estimating and estimating are trained to convergence and use the hyperparameters described in the previous paragraphs. We set and to , so we observe roughly as many dreamt samples as true samples. The samples from the generative model often act as a regularizer, helping the variational approximation (as measured on morphological tagging and lemmatization accuracy) on the UD development set, but sometimes the noise lowers performance a mite. Due to a lack of space in the initial paper, we did not deeply examine the performance of the taggerlemmatizer outside the context of improving inflection prediction accuracy. Future work will investigate question of how much tagging and lemmatization can be improved through the incorporation of samples from our generative model. In short, our efforts will evaluate the inference network in its own right, rather than just as a variational approximation to the posterior.
Appendix B Fake Data from the Sleep Phase
An example sentence sampled via in Portuguese:
dentremeticamente » isso Procusas Da Fase » pos a acordítica Máisringeringe Ditudis A ana , Urevirao Da De O linsith.muital , E que chegou interalionalmente Da anundica De mêpinsuriormentais .
and in Latin:
inpremcret ita sacrum super annum pronditi avocere quo det tuam nunsidebus quod puela ?
Comments
There are no comments yet.