Word embeddings have shown to be useful in various NLP tasks such as sentiment analysis, topic models, script learning, machine translation, sequence labeling and parsingSocher et al. (2013); Sutskever et al. (2014); Modi and Titov (2014); Nguyen et al. (2015a, b); Modi (2016); Ma and Hovy (2016); Nguyen et al. (2017); Modi et al. (2017)
. A word embedding captures the syntactic and semantic properties of a word by representing the word in a form of a real-valued vectorMikolov et al. (2013a, b); Pennington et al. (2014); Levy and Goldberg (2014).
However, usually word embedding models do not take into account lexical ambiguity. For example, the word is usually represented by a single vector representation for all senses including sloping land and financial institution. Recently, approaches have been proposed to learn multi-sense word embeddings, where each sense of a word corresponds to a sense-specific embedding. Reisinger:2010, Huang:2012 and Wu:2015 proposed methods to cluster the contexts of each word and then using cluster centroids as vector representations for word senses. neelakantan:2014, tian-EtAl:2014, li-jurafsky:2015 and chen:2015 extended Word2Vec models Mikolov et al. (2013a, b) to learn a vector representation for each sense of a word. chen-liu-sun:2014, iacobacci:2015 and flekova-gurevych:2016:P16-1 performed word sense induction using external resources (e.g., WordNet, BabelNet) and then learned sense embeddings using the Word2Vec models. RotheS15 and PilehvarC16 presented methods using pre-trained word embeddings to learn embeddings from WordNet synsets. Cheng:2015, Liu:2015, LiuPengfei:2015 and Zhang2016 directly opt the Word2Vec Skip-gram model Mikolov et al. (2013b) for learning the embeddings of words and topics on a topic-assigned corpus.
One issue in these previous works is that they assign the same weight to every sense of a word. The central assumption of our work is that each sense of a word given a context, should correspond to a mixture of weights reflecting different association degrees of the word with multiple senses in the context. The mixture weights will help to model word meaning better.
In this paper, we propose a new model for learning Multi-Sense Word Embeddings (mswe). Our mswe model learns vector representations of a word based on a mixture of its sense representations. The key difference between mswe and other models is that we induce the weights of senses while jointly learning the word and sense embeddings. Specifically, we train a topic model (Blei et al., 2003)
to obtain the topic-to-word and document-to-topic probability distributions which are then used to infer the weights of topics. We use these weights to define a compositional vector representation for each target word to predict its context words.mswe thus is different from the topic-based models (Cheng et al., 2015; Liu et al., 2015b, a; Zhang and Zhong, 2016), in which we do not use the topic assignments when jointly learning vector representations of words and topics. Here we not only learn vectors based on the most suitable topic of a word given its context, but we also take into consideration all possible meanings of the word.
The main contributions of our study are: (i) We introduce a mixture model for learning word and sense embeddings (mswe) by inducing mixture weights of word senses. (ii) We show that mswe performs better than the baseline Word2Vec Skip-gram and other embedding models on the word analogy task Mikolov et al. (2013a) and the word similarity task Reisinger and Mooney (2010).
2 The mixture model
In this section, we present the mixture model for learning multi-sense word embeddings. Here we treat topics as senses. The model learns a representation for each word using a mixture of its topical representations.
Given a number of topics and a corpus of documents , we apply a topic model Blei et al. (2003) to obtain the topic-to-word and document-to-topic probability distributions. We then infer a weight for the word with topic in document :
We define two mswe variants: mswe-1 learns vectors for words based on the most suitable topic given document while mswe-2 marginalizes over all senses of a word to take into account all possible senses of the word:
where is the compositional vector representation of the word and the topics in document ; is the target vector representation of a word type in vocabulary ; is the vector representation of topic ; is the number of topics; is defined as in Equation 1, and in mswe-1 we define .
We learn representations by minimizing the following negative log-likelihood function:
where the word in document is a target word while the word in document is a context word of and is the context size. In addition, is the context vector representation of the word type . The probability is defined using the softmax function as follows:
where each word is sampled from a noise distribution.111We use an unigram distribution raised to the 3/4 power Mikolov et al. (2013b) as the noise distribution. In fact, mswe can be viewed as a generalization of the well-known Word2Vec Skip-gram model with negative sampling Mikolov et al. (2013b) where all the mixture weights
are set to zero. The models are trained using Stochastic Gradient Descent (SGD).
We evaluate mswe on two different tasks: word similarity and word analogy. We also provide experimental results obtained by the baseline Word2Vec Skip-gram model and other previous works.
Note that not all previous results are mentioned in this paper for comparison because the training corpora used in most previous research work are much larger than ours (Baroni et al., 2014; Li and Jurafsky, 2015; Schwartz et al., 2015; Levy et al., 2015). Also there are differences in the pre-processing steps that could affect the results. We could also improve obtained results by using a larger training corpus, but this is not central point of our paper. The objective of our paper is that the embeddings of topic and word can be combined into a single mixture model, leading to good improvements as established empirically.
3.1 Experimental Setup
Following Huang et al. (2012) and Neelakantan et al. (2014), we use the Wesbury Lab Wikipedia corpus (Shaoul and Westbury, 2010) containing over 2M articles with about 990M words for training. In the preprocessing step, texts are lowercased and tokenized, numbers are mapped to 0, and punctuation marks are removed. We extract a vocabulary of 200,000 most frequent word tokens from the pre-processed corpus. Words not occurring in the vocabulary are mapped to a special token unk, in which we use the embedding of unk for unknown words in the benchmark datasets.
We firstly use a small subset extracted from the ws353 dataset (Finkelstein et al., 2002) to tune the hyper-parameters of the baseline Word2Vec Skip-gram model for the word similarity task (see Section 3.2 for the task definition). We then directly use the tuned hyper-parameters for our mswe variants. Vector size is also a hyper-parameter. While some approaches use a higher number of dimensions to obtain better results, we fix the vector size to be 300 as used by the baseline for a fair comparison. The vanilla Latent Dirichlet Allocation (LDA) topic model (Blei et al., 2003) is not scalable to a very large corpus, so we explore faster online topic models developed for large corpora. We train the online LDA topic model (Hoffman et al., 2010) on the training corpus, and use the output of this topic model to compute the mixture weights as in Equation 1.222We use default parameters in gensim (Řehůřek and Sojka, 2010) for the online LDA model. We also use the same ws353 subset to tune the numbers of topics . We find that the most suitable numbers are and then used for all our experiments. Here we learn 300-dimensional embeddings with the fixed context size (in Equation 2) and (in Equation 3) as used by the baseline. During training, we randomly initialize model parameters (i.e. word and topic embeddings) and then learn them by using SGD with the initial learning rate of 0.01.
3.2 Word Similarity
The word similarity task evaluates the quality of word embedding models Reisinger and Mooney (2010). For a given dataset of word pairs, the evaluation is done by calculating correlation between the similarity scores of corresponding word embedding pairs with the human judgment scores. Higher Spearman’s rank correlation () reflects better word embedding model. We evaluate mswe on standard datasets (as given in Table 1) for the word similarity evaluation task.
where is the vector representation of the word , is the multiple representation of the word and the topic , is the vector representation of the context of the word . And
is the cosine similarity between two vectorsand . For our experiments, we set and , in which is the concatenation operation and is inferred from the topic models by considering context as a document. only regards word embeddings, while considers multiple representations to capture different meanings (i.e. topics) and usages of a word. generalizes by taking into account the likelihood that word takes topic given context . is the inverse of the cosine distance from to Huang et al. (2012); Neelakantan et al. (2014).
3.2.1 Results for word similarity
Table 2 compares the evaluation results of mswe with results reported in prior work on the standard word similarity task when using . We use subscripts 50 and 200 to denote the topic model trained with and topics, respectively. Table 2 shows that our model outperforms the baseline Word2Vec Skip-gram model (in fifth row from bottom). Specifically, on the rw dataset, mswe obtains a significant improvement of in the Spearman’s rank correlation (which is about 8.5% relative improvement).
Compared to the published results, mswe obtains the highest accuracy on the rw, scws, ws353 and men datasets, and achieves the second highest result on the SimLex dataset. These indicate that mswe learns better representations for words taking into account different meanings.
3.2.2 Results for contextual word similarity
We evaluate our model mswe by using and on the benchmark scws dataset which considers effects of the contextual information on the word similarity task. As shown in Table 3, mswe scores better than the closely related model proposed by Cheng:2015 and generally obtains good results for this context sensitive dataset. Although we produce better scores than neelakantan:2014 and chen-liu-sun:2014 when using , we are outperformed by them when using and . neelakantan:2014 clustered the embeddings of the context words around each target word to predict its sense and chen-liu-sun:2014 used pre-trained word embeddings to initialize vector representations of senses taken from WordNet, while we use a fixed number of topics as senses for words in mswe.
3.3 Word Analogy
We evaluate the embedding models on the word analogy task introduced by MikolovCoRR2013. The task aims to answer questions in the form of “ is to as is to _ ”, denoted as “a : b c : ?” (e.g., “Hanoi : Vietnam Bern : ?”). There are 8,869 semantic and 10,675 syntactic questions grouped into 14 categories. Each question is answered by finding the most suitable word closest to “” measured by the cosine similarity. The answer is correct only if the found closest word is exactly the same as the gold-standard (correct) one for the question.
We report accuracies in Table 4 and show that mswe achieves better results in comparison with the baseline Word2Vec Skip-gram. In particular, mswe reaches the accuracies of around 69.7 which is higher than the accuracy of 68.6 obtained by Word2Vec Skip-gram.
In this paper, we described a mixture model for learning multi-sense embeddings. Our model induces mixture weights to represent a word given context based on a mixture of its sense representations. The results show that our model scores better than Word2Vec, and produces highly competitive results on the standard evaluation tasks. In future work, we will explore better methods for taking into account the contextual information. We also plan to explore different approaches to compute the mixture weights in our model. For example, if there is a large sense-annotated corpus available for training, the mixture weights could be defined based on the frequency (sense-count) distributions, instead of using the probability distributions produced by a topic model. Furthermore, it is possible to consider the weights of senses as additional model parameters to be then learned during training.
This research was funded by the German Research Foundation (DFG) as part of SFB 1102 “Information Density and Linguistic Encoding”. We would like to thank anonymous reviewers for their helpful comments.
- Baroni et al. (2014) Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 238–247.
Blei et al. (2003)
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003.
Latent Dirichlet Allocation.
Journal of Machine Learning Research3:993–1022.
Bruni et al. (2014)
Elia Bruni, Nam Khanh Tran, and Marco Baroni. 2014.
Multimodal distributional semantics.
Journal of Artificial Intelligence Research49:1–47.
Chen et al. (2015)
Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. 2015.
Improving distributed representation of word sense via wordnet gloss composition and context clustering.In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). pages 15–20.
- Chen et al. (2014) Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pages 1025–1035.
- Cheng and Kartsaklis (2015) Jianpeng Cheng and Dimitri Kartsaklis. 2015. Syntax-aware multi-sense word embeddings for deep compositional models of meaning. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pages 1531–1542.
- Cheng et al. (2015) Jianpeng Cheng, Zhongyuan Wang, Ji-Rong Wen, Jun Yan, and Zheng Chen. 2015. Contextual text understanding in distributional semantic space. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. pages 133–142.
- Finkelstein et al. (2002) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Information Systems 20:116–131.
- Flekova and Gurevych (2016) Lucie Flekova and Iryna Gurevych. 2016. Supersense embeddings: A unified model for supersense interpretation, prediction, and utilization. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 2029–2041.
- Ghannay et al. (2016) Sahar Ghannay, Benoit Favre, Yannick Estève, and Nathalie Camelin. 2016. Word embedding evaluation and combination. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).
Hill et al. (2015)
Felix Hill, Roi Reichart, and Anna Korhonen. 2015.
Simlex-999: Evaluating semantic models with genuine similarity estimation.Computational Linguistics 41:665–695.
- Hoffman et al. (2010) Matthew Hoffman, Francis R. Bach, and David M. Blei. 2010. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems 23. pages 856–864.
- Huang et al. (2012) Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1. pages 873–882.
- Iacobacci et al. (2015) Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. Sensembed: Learning sense embeddings for word and relational similarity. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pages 95–105.
- Jauhar et al. (2015) Sujay Kumar Jauhar, Chris Dyer, and Eduard Hovy. 2015. Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 683–693.
- Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems 27. pages 2177–2185.
- Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3:211–225.
- Li and Jurafsky (2015) Jiwei Li and Dan Jurafsky. 2015. Do multi-sense embeddings improve natural language understanding? In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pages 1722–1732.
Liu et al. (2015a)
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2015a.
Learning context-sensitive word embeddings with neural tensor skip-gram model.In Proceedings of the 24th International Conference on Artificial Intelligence. pages 1284–1290.
- Liu et al. (2015b) Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2015b. Topical word embeddings. In AAAI Conference on Artificial Intelligence. pages 2418–2424.
Luong et al. (2013)
Thang Luong, Richard Socher, and Christopher Manning. 2013.
Better word representations with recursive neural networks for morphology.In Proceedings of the Seventeenth Conference on Computational Natural Language Learning. pages 104–113.
- Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pages 1064–1074.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. CoRR abs/1301.3781.
- Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. pages 3111–3119.
- Modi (2016) Ashutosh Modi. 2016. Event embeddings for semantic script modeling. In Proceedings of the Conference on Computational Natural Language Learning. pages 75–83.
- Modi and Titov (2014) Ashutosh Modi and Ivan Titov. 2014. Inducing neural models of script knowledge. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning. pages 49–57.
- Modi et al. (2017) Ashutosh Modi, Ivan Titov, Vera Demberg, Asad Sayeed, and Manfred Pinkal. 2017. Modelling semantic expectation: Using script knowledge for referent prediction. Transactions of the Association for Computational Linguistics 5:31–44.
- Neelakantan et al. (2014) Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pages 1059–1069.
- Nguyen et al. (2015a) Dat Quoc Nguyen, Richard Billingsley, Lan Du, and Mark Johnson. 2015a. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics 3:299–313.
- Nguyen et al. (2017) Dat Quoc Nguyen, Mark Dras, and Mark Johnson. 2017. A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies.
- Nguyen et al. (2015b) Dat Quoc Nguyen, Kairit Sirts, and Mark Johnson. 2015b. Improving Topic Coherence with Latent Feature Word Representations in MAP Estimation for Topic Modeling. In Proceedings of the Australasian Language Technology Association Workshop 2015. pages 116–121.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014). pages 1532–1543.
- Pilehvar and Collier (2016) Mohammad Taher Pilehvar and Nigel Collier. 2016. De-conflated semantic representations. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pages 1680–1690.
- Qiu et al. (2014) Siyu Qiu, Qing Cui, Jiang Bian, Bin Gao, and Tie-Yan Liu. 2014. Co-learning of word representations and morpheme representations. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. pages 141–150.
- Rastogi et al. (2015) Pushpendre Rastogi, Benjamin Van Durme, and Raman Arora. 2015. Multiview LSA: Representation Learning via Generalized CCA. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pages 556–566.
- Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pages 45–50.
- Reisinger and Mooney (2010) Joseph Reisinger and Raymond J. Mooney. 2010. Multi-prototype vector-space models of word meaning. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. pages 109–117.
- Rothe and Schütze (2015) Sascha Rothe and Hinrich Schütze. 2015. Autoextend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Volume 1: Long Papers. pages 1793–1803.
- Schnabel et al. (2015) Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. pages 298–307.
- Schwartz et al. (2015) Roy Schwartz, Roi Reichart, and Ari Rappoport. 2015. Symmetric pattern based word embeddings for improved word similarity prediction. In Proceedings of CoNLL 2015. pages 258–267.
- Shaoul and Westbury (2010) Cyrus Shaoul and Chris Westbury. 2010. The westbury lab wikipedia corpus. Edmonton, AB: University of Alberta .
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. pages 1631–1642.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems. pages 3104–3112.
- Tian et al. (2014) Fei Tian, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang, Enhong Chen, and Tie-Yan Liu. 2014. A probabilistic model for learning multi-prototype word embeddings. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. pages 151–160.
- Vilnis and McCallum (2015) Luke Vilnis and Andrew McCallum. 2015. Word representations via gaussian embedding. International Conference on Learning Representations (ICLR) .
- Wu and Giles (2015) Zhaohui Wu and C. Lee Giles. 2015. Sense-aware semantic analysis: A multi-prototype word representation model using wikipedia. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. pages 2188–2194.
- Zhang and Zhong (2016) Heng Zhang and Guoqiang Zhong. 2016. Improving short text classification by learning vector representations of both words and hidden topics. Knowledge-Based Systems 102:76––86.