Code for EMNLP 2016 paper: Morphological Priors for Probabilistic Word Embeddings
Word embeddings allow natural language processing systems to share statistical information across related words. These embeddings are typically based on distributional statistics, making it difficult for them to generalize to rare or unseen words. We propose to improve word embeddings by incorporating morphological information, capturing shared sub-word features. Unlike previous work that constructs word embeddings directly from morphemes, we combine morphological and distributional information in a unified probabilistic framework, in which the word embedding is a latent variable. The morphological information provides a prior distribution on the latent word embeddings, which in turn condition a likelihood function over an observed corpus. This approach yields improvements on intrinsic word similarity evaluations, and also in the downstream task of part-of-speech tagging.READ FULL TEXT VIEW PDF
Word embeddings improve generalization over lexical features by placing ...
Word embeddings are a fixed, distributional representation of the contex...
In this paper, we introduce a trie-structured Bayesian model for unsuper...
Social scientists have recently turned to analyzing text using tools fro...
We look into the task of generalizing word embeddings: given a set of
Word embeddings have been shown to be useful across state-of-the-art sys...
Machine learning about language can be improved by supplying it with spe...
Code for EMNLP 2016 paper: Morphological Priors for Probabilistic Word Embeddings
Word embeddings have been shown to improve many natural language processing applications, from language models [mikolov2010recurrent] to information extraction [collobert2008unified], and from parsing [chen2014fast] to machine translation [cho2014learning]. Word embeddings leverage a classical idea in natural language processing: use distributional statistics from large amounts of unlabeled data to learn representations that allow sharing across related words [brown1992class]. While this approach is undeniably effective, the long-tail nature of linguistic data ensures that there will always be words that are not observed in even the largest corpus [zipf1949human]
. There will be many other words which are observed only a handful of times, making the distributional statistics too sparse to accurately estimate the 100- or 1000-dimensional dense vectors that are typically used for word embeddings. These problems are particularly acute in morphologically rich languages like German and Turkish, where each word may have dozens of possible inflections.
Recent work has proposed to address this issue by replacing word-level embeddings with embeddings based on subword units: morphemes [luong2013better, botha2014compositional] or individual characters [santos2014learning, ling2015finding, kim2016character]. Such models leverage the fact that word meaning is often compositional, arising from subword components. By learning representations of subword units, it is possible to generalize to rare and unseen words.
But while morphology and orthography are sometimes a signal of semantics, there are also many cases similar spellings do not imply similar meanings: better-batter, melon-felon, dessert-desert, etc. If each word’s embedding is constrained to be a deterministic function of its characters, as in prior work, then it will be difficult to learn appropriately distinct embeddings for such pairs. Automated morphological analysis may be incorrect: for example, really may be segmented into re+ally, incorrectly suggesting a similarity to revise and review. Even correct morphological segmentation may be misleading. Consider that incredible and inflammable share a prefix in-, which exerts the opposite effect in these two cases.111The confusion is resolved by morphologically analyzing the second example as (in+flame)+able, but this requires hierarchical morphological parsing, not just segmentation. Overall, a word’s observed internal structure gives evidence about its meaning, but it must be possible to override this evidence when the distributional facts point in another direction.
We formalize this idea using the machinery of probabilistic graphical models. We treat word embeddings as latent variables [vilnis2015word]
, which are conditioned on a prior distribution that is based on word morphology. We then maximize a variational approximation to the expected likelihood of an observed corpus of text, fitting variational parameters over latent binary word embeddings. For common words, the expected word embeddings are largely determined by the expected corpus likelihood, and thus, by the distributional statistics. For rare words, the prior plays a larger role. Since the prior distribution is a function of the morphology, it is possible to impute embeddings for unseen words after training the model.
We model word embeddings as latent binary vectors. This choice is based on linguistic theories of lexical semantics and morphology. Morphemes are viewed as adding morphosyntactic features to words: for example, in English, un- adds a negation feature (unbelievable), -s adds a plural feature, and -ed adds a past tense feature [halle1993distributed]
. Similarly, the lexicon is often viewed as organized in terms of features: for example, the wordbachelor carries the features human, male, and unmarried [katz1963structure]. Each word’s semantic role within a sentence can also be characterized in terms of binary features [dowty1991thematic, reisinger2015semantic]. Our approach is more amenable to such theoretical models than traditional distributed word embeddings. However, we can also work with the expected word embeddingsbengio2013representation].
The modeling framework is illustrated in Figure 1, focusing on the word sesquipedalianism. This word is rare, but its morphology indicates several of its properties: the -ism suffix suggests that the word is a noun, likely describing some abstract property; the sesqui- prefix refers to one and a half, and so on. If the word is unknown, we must lean heavily on these intuitions, but if the word is well attested then we can rely instead on its examples in use.
It is this reasoning that our modeling framework aims to formalize. We treat word embeddings as latent variables in a joint probabilistic model. The prior distribution over a word’s embedding is conditioned on its morphological structure. The embedding itself then participates, as a latent variable, in a neural sequence model over a corpus, contributing to the overall corpus likelihood. If the word appears frequently, then the corpus likelihood dominates the prior — which is equivalent to relying on the word’s distributional properties. If the word appears rarely, then the prior distribution steps in, and gives a best guess as to the word’s meaning.
Before describing these component pieces in detail, we first introduce some notation. The representation of word is a latent binary vector , where is the size of each word embedding. As noted in the introduction, this binary representation is motivated by feature-based theories of lexical semantics [katz1963structure]. Each word is constructed from a set of observed morphemes, . Each morpheme is in turn drawn from a finite vocabulary of size , so that . Morphemes are obtained from an unsupervised morphological segmenter, which is treated as a black box. Finally, we are given a corpus, which is a sequence of words, , where each word , with equal to the size of the vocabulary, including the token for unknown words.
The key differentiating property of this model is that rather than estimating word embeddings directly, we treat them as a latent variable, with a prior distribution reflecting the word’s morphological properties. To characterize this prior distribution, each morpheme is associated with an embedding of its own, , where is again the embedding size. Then for position of the word embedding , we have the following prior,
indicates the sigmoid function. The prior log-likelihood for a set of word embeddings is,
The corpus likelihood is computed via a recurrent neural network language model [mikolov2010recurrent, RNNLM], which is a generative model of sequences of tokens. In the RNNLM, the probability of each word is conditioned on all preceding words through a recurrently updated state vector. This state vector in turn depends on the embeddings of the previous words, through the following update equations:
The function is a recurrent update equation; in the RNN, it corresponds to , where is the elementwise sigmoid function. The matrix contains the “output embeddings” of each word in the vocabulary. We can then define the conditional log-likelihood of a corpus as,
Since is deterministically computed from (conditioned on ), we can equivalently write the log-likelihood as,
This same notation can be applied to compute the likelihood under a long-short term memory (LSTM) language model [sundermeyer2012lstm]. The only difference is that the recurrence function from Equation 6 now becomes more complex, including the input, output, and forget gates, and the recurrent state now includes the memory cell. As the LSTM update equations are well known, we focus on the more concise RNN notation, but we employ LSTMs in all experiments due to their better ability to capture long-range dependencies.
Inference on the marginal likelihood is intractable. We address this issue by making a variational approximation,
The variational distribution is defined using a fully factorized mean field approximation,
The variational distribution is a product of Bernoullis, with parameters . In the evaluations that follow, we use the expected word embeddings , which are dense vectors in . We can then use to place a variational lower bound on the expected conditional likelihood,
Even with this variational approximation, the expected log-likelihood is still intractable to compute. In recurrent neural network language models, each word is conditioned on the entire prior history, — indeed, this is one of the key advantages over fixed-length -gram models. However, this means that the individual expected log probabilities involve not just the word embedding of and its immediate predecessor, but rather, the embeddings of all words in the sequence :
We therefore make a further approximation by taking a local expectation over the recurrent state,
This approximation means that we do not propagate uncertainty about
through the recurrent update or through the likelihood function, but rather, we use local point estimates. Alternative methods such as variational autoencoders[chung2015recurrent] or sequential Monte Carlo [de2000sequential] might provide better and more principled approximations, but this direction is left for future work.
Variational bounds in the form of Equation 13 can generally be expressed as a difference between an expected log-likelihood term and a term for the Kullback-Leibler (KL) divergence between the prior distribution and the variational distribution [wainwright2008graphical]. Incorporating the approximation in Equation 19, the resulting objective is,
The KL divergence is equal to,
Each term in the variational bound can be easily constructed in a computation graph, enabling automatic differentiation and the application of standard stochastic optimization techniques.
The objective function is given by the variational lower bound in § 2.3, using the approximation to the conditional likelihood described in Equation 19. This function is optimized in terms of several parameters:
the morpheme embeddings, ;
the variational parameters on the word embeddings, ;
the output word embeddings ;
the parameter of the recurrence function, .
Each of these parameters is updated via the RMSProp online learning algorithm [tieleman2012rmsprop]. The model and baseline (described below) are implemented in blocks [blocks2015]. In the remainder of the paper, we refer to our model as VarEmbed.
All embeddings are trained on 22 million tokens from the the North American News Text (NANT) corpus [graff1995north]. We use an initial vocabulary of 50,000 words, with a special token for words that are not among the 50,000 most common. We then perform downcasing and convert all numeric tokens to a special token. After these steps, the vocabulary size decreases to 48,986. Note that the method can impute word embeddings for out-of-vocabulary words under the prior distribution ; however, it is still necessary to decide on a vocabulary size to determine the number of variational parameters and output embeddings to estimate.
Unsupervised morphological segmentation is performed using Morfessor [creutz2002unsupervised], with a maximum of sixteen morphemes per word. This results in a total of 14,000 morphemes, which includes stems for monomorphemic words. We do not rely on any labeled information about morphological structure, although the incorporation of gold morphological analysis is a promising topic for future work.
The LSTM parameters are initialized uniformly in the range . The word embeddings are initialized using pre-trained word2vec
embeddings. We train the model for 15 epochs, with an initial learning rate of, a decay of per epoch, and minibatches of size . We clip the norm of the gradients (normalized by minibatch size) at 1, using the default settings in the RMSprop implementation in blocks. These choices are motivated by prior work [zaremba2014recurrent]. After each iteration, we compute the objective function on the development set; when the objective does not improve beyond a small threshold, we halve the learning rate.
Training takes roughly one hour per iteration using an NVIDIA 670 GTX, which is a commodity graphics processing unit (GPU) for gaming. This is nearly identical to the training time required for our reimplementation of the algorithm of botha2014compositional, described below.
The most comparable approach is that of botha2014compositional. In their work, embeddings are estimated for each morpheme, as well as for each in-vocabulary word. The final embedding for a word is then the sum of these embeddings, e.g.,
where the italicized elements represent learned embeddings.
We build a baseline that is closely inspired by this approach, which we call SumEmbed. The key difference is that while botha2014compositional build on the log-bilinear language model [mnih2007three], we use the same LSTM-based architecture as in our own model implementation. This enables our evaluation to focus on the critical difference between the two approaches: the use of latent variables rather than summation to model the word embeddings. As with our method, we used pre-trained word2vec embeddings to initialize the model.
The dominant terms in the overall number of parameters are the (expected) word embeddings themselves. The variational parameters of the input word embeddings, , are of size . The output word embeddings are of size . The morpheme embeddings are of size , with . In our main experiments, we set (see above), , and . After including the character embeddings and the parameters of the recurrent models, the total number of parameters is roughly million. This is identical to number of parameters in the SumEmbed baseline.
|all words (353)||n/a||42.9||48.8|
|rare words (rw)|
|all words (2034)||n/a||23.0||24.0|
Our evaluation compares the following embeddings:
We train the popular word2vec CBOW (continuous bag of words) model [mikolov2013distributed], using the gensim implementation.
We compare against the baseline described in § 3.3, which can be viewed as a reimplementation of the compositional model of botha2014compositional.
For our model, we take the expected embeddings , and then pass them through an inverse sigmoid function to obtain values over the entire real line.
Our first evaluation is based on two classical word similarity datasets: Wordsim353 [finkelstein2001placing] and the Stanford “rare words” (rw) dataset [luong2013better]. We report Spearmann’s , a measure of rank correlation, evaluating on both the entire vocabulary as well as the subset of in-vocabulary words.
As shown in Table 1, VarEmbed consistently outperforms SumEmbed on both datasets. On the subset of in-vocabulary words, word2vec gives slightly better results on the wordsim words that are in the NANT vocabulary, but is not applicable to the complete dataset. On the rare words dataset, word2vec performs considerably worse than both morphology-based models, matching the findings of luong2013better and botha2014compositional regarding the importance of morphology for doing well on this dataset.
Recent work questions whether these word similarity metrics are predictive of performance on downstream tasks [faruqui2016problems]. The qvec statistic is another intrinsic evaluation method, which has been shown to be better correlated with downstream tasks [tsvetkov2015evaluation]. This metric measures the alignment between word embeddings and a set of lexical semantic features. Specifically, we use the semcor noun verb supersenses oracle provided at the qvec github repository.222https://github.com/ytsvetko/qvec
|all words (4199)||in vocab (3997)|
As shown in Table 2, VarEmbed outperforms SumEmbed on the full lexicon, and gives similar performance to word2vec on the set of in-vocabulary words. We also consider the morpheme embeddings alone. For SumEmbed, this means that we construct the word embedding from the sum of the embeddings for its morphemes, without the additional embedding per word. For VarEmbed, we use the expected embedding under the prior distribution . The results for these representations are shown in the bottom half of Table 2, revealing that VarEmbed learns much more meaningful embeddings at the morpheme level, while much of the power of SumEmbed seems to come from the word embeddings.
Our final evaluation is on the downstream task of part-of-speech tagging, using the Penn Treebank. We build a simple classification-based tagger, using a feedforward neural network. (This is not intended as an alternative to state-of-the-art tagging algorithms, but as a comparison of the syntactic utility of the information encoded in the word embeddings.) The inputs to the network are the concatenated embeddings of the five word neighborhood ; as in all evaluations, 128-dimensional embeddings are used, so the total size of the input is 640. This input is fed into a network with two hidden layers of size , and a softmax output layer over all tags. We train using RMSProp [tieleman2012rmsprop].
Results are shown in Table 3. Both morphologically-informed embeddings are significantly better to word2vec (, two-tailed binomial test), but the difference between SumEmbed and VarEmbed is not significant at . Figure 2 breaks down the errors by word frequency. As shown in the figure, the tagger based on word2vec performs poorly for rare words, which is expected because these embeddings are estimated from sparse distributional statistics. SumEmbed is slightly better on the rarest words (the group accounts for roughly 10% of all tokens). In this case, it appears that this simple additive model is better, since the distributional statistics are too sparse to offer much improvement. The probabilistic VarEmbed embeddings are best for all other frequency groups, showing that it effectively combines morphology and distributional statistics.
An alternative approach to incorporating additional information into word embeddings is to constrain the embeddings of semantically-related words to be similar. Such work typically draws on existing lexical semantic resources such as WordNet. For example, yu2014improving define a joint training objective, in which the word embedding must predict not only neighboring word tokens in a corpus, but also related word types in a semantic resource; a similar approach is taken by bian2014knowledge. Alternatively, faruqui2015retrofitting propose to “retrofit” pre-trained word embeddings over a semantic network. Both retrofitting and our own approach treat the true word embeddings as latent variables, from which the pre-trained word embeddings are stochastically emitted. However, a key difference from our approach is that the underlying representation in these prior works is relational, and not generative. These methods can capture similarity between words in a relational lexicon such as WordNet, but they do not offer a generative account of how (approximate) meaning is constructed from orthography or morphology.
The SumEmbed baseline is based on the work of botha2014compositional, in which words are segmented into morphemes using Morfessor [creutz2002unsupervised], and then word representations are computed through addition of morpheme representations. A key modeling difference from this prior work is that rather than computing word embeddings directly and deterministically from subcomponent embeddings (morphemes or characters, as in [ling2015finding, kim2016character]), we use these subcomponents to define a prior distribution, which can be overridden by distributional statistics for common words. Other work exploits morphology by training word embeddings to optimize a joint objective over distributional statistics and rich, morphologically-augmented part of speech tags [cotterell2015morphological]. This can yield better word embeddings, but does not provide a way to compute embeddings for unseen words, as our approach does.
Recent work by cotterell2016morphological extends the idea of retrofitting, which was based on semantic similarity, to a morphological framework. In this model, embeddings are learned for morphemes as well as for words, and each word embedding is conditioned on the sum of the morpheme embeddings, using a multivariate Gaussian. The covariance of this Gaussian prior is set to the inverse of the number of examples in the training corpus, which has the effect of letting the morphology play a larger role for rare or unseen words. Like all retrofitting approaches, this method is applied in a pipeline fashion after training word embeddings on a large corpus; in contrast, our approach is a joint model over the morphology and corpus. Another practical difference is that cotterell2016morphological use gold morphological features, while we use an automated morphological segmentation.
Word embeddings are typically treated as a parameter, and are optimized through point estimation [bengio2003neural, collobert2008unified, mikolov2010recurrent]. Current models use word embeddings with hundreds or even thousands of parameters per word, yet many words are observed only a handful of times. It is therefore natural to consider whether it might be beneficial to model uncertainty over word embeddings. vilnis2015word propose to model Gaussian densities over dense vector word embeddings. They estimate the parameters of the Gaussian directly, and, unlike our work, do not consider using orthographic information as a prior distribution. This is easy to do in the latent binary framework proposed here, which is also a better fit for some theoretical models of lexical semantics [katz1963structure, reisinger2015semantic]. This view is shared by kruszewski2015deriving, who induce binary word representations using labeled data of lexical semantic entailment relations, and by henderson2016vector, who take a mean field approximation over binary representations of lexical semantic features to induce hyponymy relations.
More broadly, our work is inspired by recent efforts to combine directed graphical models with discriminatively trained “deep learning” architectures. The variational autoencoder[kingma2013auto], neural variational inference [mnih2014neural, miao2016neural], and black box variational inference [ranganath2014black] all propose to use a neural network to compute the variational approximation. These ideas are employed by chung2015recurrent in the variational recurrent neural network, which places a latent continuous variable at each time step. In contrast, we have a dictionary of latent variables — the word embeddings — which introduce uncertainty over the hidden state in a standard recurrent neural network or LSTM. We train this model by employing a mean field approximation, but these more recent techniques for neural variational inference may also be applicable. We plan to explore this possibility in future work.
We present a model that unifies compositional and distributional perspectives on lexical semantics, through the machinery of Bayesian latent variable models. In this framework, our prior expectations of word meaning are based on internal structure, but these expectations can be overridden by distributional statistics. The model is based on the very successful long-short term memory (LSTM) for sequence modeling, and while it employs a Bayesian justification, its inference and estimation are little more complicated than a standard LSTM. This demonstrates the advantages of reasoning about uncertainty even when working in a “neural” paradigm.
This work represents a first step, and we see many possibilities for improving performance by extending it. Clearly we would expect this model to be more effective in languages with richer morphological structure than English, and we plan to explore this possibility in future work. From a modeling perspective, our prior distribution merely sums the morpheme embeddings, but a more accurate model might account for sequential or combinatorial structure, through a recurrent [ling2015finding], recursive [luong2013better], or convolutional architecture [kim2016character]. There appears to be no technical obstacle to imposing such structure in the prior distribution. Furthermore, while we build the prior distribution from morphemes, it is natural to ask whether characters might be a better underlying representation: character-based models may generalize well to non-word tokens such as names and abbreviations, they do not require morphological segmentation, and they require a much smaller number of underlying embeddings. On the other hand, morphemes encode rich regularities across words, which may make a morphologically-informed prior easier to learn than a prior which works directly at the character level. It is possible that this tradeoff could be transcended by combining characters and morphemes in a single model.
Another advantage of latent variable models is that they admit partial supervision. If we follow tsvetkov2015evaluation in the argument that word embeddings should correspond to lexical semantic features, then an inventory of such features could be used as a source of partial supervision, thus locking dimensions of the word embeddings to specific semantic properties. This would complement the graph-based “retrofitting” supervision proposed by faruqui2015retrofitting, by instead placing supervision at the level of individual words.
Thanks to Erica Briscoe, Martin Hyatt, Yangfeng Ji, Bryan Leslie Lee, and Yi Yang for helpful discussion of this work. Thanks also the EMNLP reviewers for constructive feedback. This research is supported by the Defense Threat Reduction Agency under award HDTRA1-15-1-0019.