Learning representations for words is a fundamental step in various NLP tasks. If we can accurately represent the meanings of words using some linear algebraic structure such as vectors, we can use those word-level semantic representations to compute representations for larger lexical units such as phrases, sentences, or texts[Socher et al.2012, Le and Mikolov2014]. Moreover, by using word representations as features in downstream NLP applications, significant improvements in performance have been obtained [Turian et al.2010, Bollegala et al.2015, Collobert et al.2011]. Numerous approaches for learning word representations from large text corpora have been proposed, such as counting-based methods [Turney and Pantel2010] that follow the distributional hypothesis and use contexts of a word to represent that word, and prediction-based methods [Mikolov et al.2013b] that learn word representations by predicting the occurrences of a word in a given context [Baroni et al.2014].
Complementary to the corpus-based data-driven approaches for learning word representations, significant manual effort has already been invested in creating semantic lexicons such as the WordNet [Miller1995]. Semantic lexicons explicitly define the meaning of words by specifying the relations that exist between words, such as synonymy, hypernymy, or meronymy. Although it is attractive to learn word representation purely from a corpus in an unsupervised fashion because it obviates the need for manual data annotation, there exist several limitations to this corpus-only approach where a semantic lexicon could help to overcome. First, corpus-based approaches operate on surface-level word co-occurrences, and ignore the rich semantic relations that exist between the two words that co-occur in the corpus. Second, unlike in a semantic lexicon where a word is grouped with the other words with similar senses (e.g., WordNet synsets), occurrences of a word in a corpus can be ambiguous. Third, the corpus might not be sufficiently large to obtain reliable word co-occurrence counts, which is problematic when learning representations for rare words.
On the other hand, purely using a semantic lexicon to learn word representations [Bollegala et al.2014]
can also be problematic. First, unlike in a corpus where we can observe numerous co-occurrences between two words in different contexts, in a semantic lexicon we often have only a limited number of entries for a particular word. Therefore, it is difficult to accurately estimate the strength of the relationship between two words using only a semantic lexicon. Second, a corpus is likely to include neologisms or novel, creative uses of existing words. Because most semantic lexicons are maintained manually on a periodical basis, such trends will not be readily reflected in the semantic lexicon. Considering the weaknesses of methods that use either only a corpus or only a semantic lexicon, it is a natural motivation for us to explore hybrid approaches.
To illustrate how a semantic lexicon can potentially assist the corpus-based word representation learning process, let us consider the sentence “I like both cats and dogs”. If this was the only sentence in the corpus where the three words cat, dog, and like occur, we would learn word representations for those three words that predict cat to be equally similar to like as to dog because, there is exactly one co-occurrence between all three pairs generated from those three words. However, a semantic lexicon would list cat and dog as hyponyms of pet, but not of like. Therefore, by incorporating such constraints from a semantic lexicon into the word representation learning process, we can potentially overcome this problem.
We propose a method to learn word representations using both a corpus and a semantic lexicon in a joint manner. Initially, all words representations are randomly initialized with fixed, low-dimensional, real-valued vectors, which are subsequently updated to predict the co-occurrences between two words in a corpus. We use a regularized version of the global co-occurrence prediction approach proposed by Pennington et al. Pennington:EMNLP:2014 as our objective function. We use the semantic lexicon to construct a regularizer that enforces two words that are in a particular semantic relationship in the lexicon to have similar word representations. Unlike retrofitting [Faruqui et al.2015], which fine-tunes pre-trained word representations in a post-processing step, our method jointly learns both from the corpus as well as from the semantic lexicon, thereby benefitting from the knowledge in the semantic lexicon during the word representation learning stage.
In our experiments, we use seven relation types found in the WordNet, and compare the word representations learnt by the proposed method. Specifically, we evaluate the learnt word representations on two standard tasks: semantic similarity prediction [Bollegala. et al.2007], and word analogy prediction [Duc et al.2011]. On both those tasks, our proposed method statistically significantly outperforms all previously proposed methods for learning word representations using a semantic lexicon and a corpus. The performance of the proposed method is stable across a wide-range of vector dimensionalities. Furthermore, experiments conducted using different sized corpora show that the benefit of incorporating a semantic lexicon is more prominent for smaller corpora.
2 Related Work
Learning word representations using large text corpora has received a renewed interest recently due to the impressive performance gains obtained in downstream NLP applications using the word representations as features [Collobert et al.2011, Turian et al.2010]. Continuous bag-of-words (CBOW) and skip-gram (SG) methods proposed by Mikolov et al. [Mikolov et al.2013a] use the local co-occurrences of a target word and other words in its context for learning word representations. Specifically, CBOW predicts a target word given its context, whereas SG predicts the context given the target word. Global vector prediction (GloVe) [Pennington et al.2014] on the other hand first builds a word co-occurrence matrix and predicts the total co-occurrences between a target word and a context word. Unlike SG and CBOW, GloVe does not require negative training instances, and is less likely to affect from random local co-occurrences because it operates on global counts. However, all of the above mentioned methods are limited to using only a corpus, and the research on using semantic lexicons in the word representation learning process has been limited.
Yu:ACL:2014 Yu:ACL:2014 proposed the relation constrained model
(RCM) where they used word similarity information to improve the word representations learnt using CBOW. Specifically, RCM assigns high probabilities to words that are listed as similar in the lexicon. Although we share the same motivation as Yu:ACL:2014 Yu:ACL:2014 for jointly learning word representations from a corpus and a semantic lexicon, our method differs from theirs in several aspects. First, unlike in RCM where only synonymy is considered, we use different types of semantic relations in our model. As we show later in Section4, besides synonymy, numerous other semantic relation types are useful for different tasks. Second, unlike the CBOW objective used in RCM that considers only local co-occurrences, we use global co-occurrences over the entire corpus. This approach has several benefits over the CBOW method, such as we are not required to normalize over the entire vocabulary to compute conditional probabilities, which is computationally costly for large vocabularies. Moreover, we do not require pseudo negative training instances. Instead, the number of co-occurrences between two words is predicted using the inner-product between the corresponding word representations. Indeed, Pennington:EMNLP:2014 Pennington:EMNLP:2014 show that we can learn superior word representations by predicting global co-occurrences instead of local co-occurrences.
Xu:2014 Xu:2014 proposed RC-NET that uses both relational (R-NET) and categorical (C-NET) information in a knowledge base (KB) jointly with the skip-gram objective for learning word representations. They represent both words and relations in the same embedded space. Specifically, given a relational tuple where a semantic relation exists in the KB between two words and , they enforce the constraint that the vector sum of the representations for and must be similar to for words that have the relation with . Similar to RCM, RC-NET is limited to using local co-occurrence counts.
In contrast to the joint learning methods discussed above, faruqui-EtAl:2015:NAACL-HLT faruqui-EtAl:2015:NAACL-HLT proposed retrofitting, a post-processing method that fits pre-trained word representations for a given semantic lexicon. The modular approach of retrofitting is attractive because it can be used to fit arbitrary pre-trained word representations to an arbitrary semantic lexicon, without having to retrain the word representations. johansson-nietopina:2015:NAACL-HLT johansson-nietopina:2015:NAACL-HLT proposed a method to embed a semantic network consisting of linked word senses into a continuous-vector word space. Similar to retrofitting, their method takes pre-trained word vectors and computes sense vectors over a given semantic network. However, a disadvantage of such an approach is that we cannot use the rich information in the semantic lexicon when we learn the word representations from the corpus. Moreover, incompatibilities between the corpus and the lexicon, such as the differences in word senses, and missing terms must be carefully considered. We experimentally show that our joint learning approach outperforms the post-processing approach used in retrofitting.
iacobacci-pilehvar-navigli:2015:ACL-IJCNLP iacobacci-pilehvar-navigli:2015:ACL-IJCNLP used BabelNet and consider words that are connected to a source word in BabelNet to overcome the difficulties when measuring the similarity between rare words. However, they do not consider the semantic relations between words and only consider words that are listed as related in the BabelNet, which encompasses multiple semantic relations. Bollegala:AAAI:2015 Bollegala:AAAI:2015 proposed a method for learning word representations from a relational graph, where they represent words and relations respectively by vectors and matrices. Their method can be applied on either a manually created relational graph, or an automatically extracted one from data. However, during training they use only the relational graph and do not use the corpus.
3 Learning Word Representations
Given a corpus , and a semantic lexicon , we describe a method for learning word representations for words in the corpus. We use the boldface to denote the word (vector) representation of the -th word , and the vocabulary (i.e., the set of all words in the corpus) is denoted by . The dimensionality
of the vector representation is a hyperparameter of the proposed method that must be specified by the user in advance. Any semantic lexicon that specifies the semantic relations that exist between words could be used as, such as the WordNet [Miller1995], FrameNet [Baker et al.1998], or the Paraphrase Database [Ganitkevitch et al.2013]. In particular, we do not assume any structural properties unique to a particular semantic lexicon. In the experiments described in this paper we use the WordNet as the semantic lexicon.
Following Pennington:EMNLP:2014 Pennington:EMNLP:2014, first we create a co-occurrence matrix in which words that we would like to learn representations for (target words) are arranged in rows of , whereas words that co-occur with the target words in some contexts (context words) are arranged in columns of . The -th element of is set to the total co-occurrences of and in the corpus. Following the recommendations in prior work on word representation learning [Levy et al.2015], we set the context window to the tokens preceding and succeeding a word in a sentence. We then extract unigrams from the co-occurrence windows as the corresponding context words. We down-weight distant (and potentially noisy) co-occurrences using the reciprocal of the distance in tokens between the two words that co-occur.
A word is assigned two vectors and denoting whether is respectively the target of the prediction (corresponding to the rows of ), or in the context of another word (corresponding to the columns of ). The GloVe objective can then be written as:
Here, and are real-valued scalar bias terms that adjust for the difference between the inner-product and the logarithm of the co-occurrence counts. The function discounts the co-occurrences between frequent words and is given by:
Following [Pennington et al.2014], we set and in our experiments. The objective function defined by (1) encourages the learning of word representations that demonstrate the desirable property that vector difference between the word embeddings for two words represents the semantic relations that exist between those two words. For example, Mikolov:NAACL:2013 Mikolov:NAACL:2013 observed that the difference between the word embeddings for the words king and man when added to the word embedding for the word woman yields a vector similar to that of queen.
Unfortunately, the objective function given by (1) does not capture the semantic relations that exist between and as specified in the lexicon . Consequently, it considers all co-occurrences equally and is likely to encounter problems when the co-occurrences are rare. To overcome this problem we propose a regularizer, , by considering the three-way co-occurrence among words , , and a semantic relation that exists between the target word and one of its context words in the lexicon as follows:
Here, is a binary function that returns if the semantic relation exists between the words and in the lexicon, and otherwise. In general, semantic relations are asymmetric. Thus, we have . Experimentally, we consider both symmetric relation types, such as synonymy and antonymy, as well as asymmetric relation types, such as hypernymy and meronymy. The regularizer given by (3) enforces the constraint that the words that are connected by a semantic relation in the lexicon must have similar word representations.
Here, is a non-negative real-valued regularization coefficient that determines the influence imparted by the semantic lexicon on the word representations learnt from the corpus. We use development data to estimate the optimal value of as described later in Section 4.
The overall objective function given by (4) is non-convex w.r.t. to the four variables , , , and . However, if we fix three of those variables, then (4) becomes convex in the remaining one variable. We use an alternative optimization approach where we first randomly initialize all the parameters, and then cycle through the set of variables in a pre-determined order updating one variable at a time while keeping the other variables fixed.
The derivatives of the objective function w.r.t. the variables are given as follows:
We use stochastic gradient descent (SGD) with learning rate scheduled by AdaGrad[Duchi et al.2011] as the optimization method. The overall algorithm for learning word embeddings is listed in Algorithm 1.
We used the ukWaC111http://wacky.sslmit.unibo.it
as the corpus. It has ca. 2 billion tokens and have been used for learning word embeddings in prior work. We initialize word embeddings by randomly sampling each dimension from the uniform distribution in the range. We set the initial learning rate in AdaGrad to in our experiments. We observed that iterations is sufficient for the proposed method to converge to a solution.
Building the co-occurrence matrix is an essential pre-processing step for the proposed method. Because the co-occurrences between rare words will also be rare, we can first count the frequency of each word and drop words that have total frequency less than a pre-defined threshold to manage the memory requirements of the co-occurrence matrix. In our experiments, we dropped words that are less than times in the entire corpus when building the co-occurrence matrix. For storing larger co-occurrence matrices we can use distributed hash tables and sparse representations.
The for-loop in Line 3 of Algorithm 1 iterates over the non-zero elements in . If the number of non-zero elements in is , the overall time complexity of Algorithm 1 can be estimated as , where denotes the number of words in the vocabulary. Typically, the global co-occurrence matrix is highly sparse, containing less than of non-zero entries. It takes under 50 mins. to learn dimensional word representations for words () from the ukWaC corpus on a Xeon 2.9GHz 32 core 512GB RAM machine. The source code and data for the proposed method is publicly available222https://github.com/Bollegala/jointreps.
4 Experiments and Results
We evaluate the proposed method on two standard tasks: predicting the semantic similarity between two words, and predicting proportional analogies consisting of two pairs of words. For the similarity prediction task, we use the following benchmark datasets: Rubenstein-Goodenough (RG, 65 word-pairs) [Rubenstein and Goodenough1965], Miller-Charles (MC, 30 word-pairs) [Miller and Charles1998], rare words dataset (RW, 2034 word-pairs) [Luong et al.2013], Stanford’s contextual word similarities (SCWS, 2023 word-pairs) [Huang et al.2012], and the MEN test collection (3000 word-pairs) [Bruni et al.2012]. Each word-pair in those benchmark datasets has a manually assigned similarity score, which we consider as the gold standard rating for semantic similarity.
For each word , the proposed method learns a target representation , and a context representation . [Levy et al.2015] show that the addition of the two vectors, , gives a better representation for the word
. In particular, when we measure the cosine similarity between two words using their word representations, this additive approach considers both first and second-order similarities between the two words.[Pennington et al.2014] originally motivated this additive operation as an ensemble method. Following these prior recommendations, we add the target and context representations to create the final representation for a word. The remainder of the experiments in the paper use those word representations.
Next, we compute the cosine similarity between the two corresponding embeddings of the words. Following the standard approach for evaluating using the above-mentioned benchmarks, we measure Spearman correlation coefficient between gold standard ratings and the predicted similarity scores. We use the Fisher transformation to test for the statistical significance of the correlations. Table 1 shows the Spearman correlation coefficients on the five similarity benchmarks, where high values indicate a better agreement with the human notion of semantic similarity.
For the word analogy prediction task we used two benchmarks: Google word analogy dataset [Mikolov et al.2013b], and SemEval 2012 Task 2 dataset [Jurgens et al.2012] (SemEval). Google dataset consists of syntactic (syn) analogies, and semantic analogies (sem). The SemEval dataset contains manually ranked word-pairs for word-pairs describing various semantic relation types, such as defective, and agent-goal. In total there are word-pairs in the SemEval dataset. Given a proportional analogy , we compute the cosine similarity between and , where the boldface symbols represent the embeddings of the corresponding words. For the Google dataset, we measure the accuracy for predicting the fourth word
in each proportional analogy from the entire vocabulary. We use the binomial exact test with Clopper-Pearson confidence interval to test for the statistical significance of the reported accuracy values. For SemEval we use the official evaluation tool333https://sites.google.com/site/semeval2012task2/ to compute MaxDiff scores.
In Table 1, we compare the word embeddings learnt by the proposed method for different semantic relation types in the WordNet. All word embeddings compared in Table 1 are dimensional. We use the WordSim-353 (WS) dataset [Finkelstein et al.2002] as validation data to find the optimal value of for each relation type. Specifically, we minimize (4) for different values, and use the learnt word representations to measure the cosine similarity for the word-pairs in the WS dataset. We then select the value of that gives the highest Spearman correlation with the human ratings on the WS dataset. This procedure is repeated separately with each semantic relation type . We found that values greater than to perform consistently well on all relation types. The level of performance if we had used only the corpus for learning word representations (without using a semantic lexicon) is shown in Table 1 as the corpus only baseline. This baseline corresponds to setting in (4).
From Table 1, we see that by incorporating most of the semantic relations found in the WordNet we can improve over the corpus only baseline. In particular, the improvements reported by synonymy over the corpus only baseline is statistically significant on RG, MC, SCWS, MEN, syn, and SemEval. Among the individual semantic relations, synonymy consistently performs well on all benchmarks. Among the other relations, part-holonyms and member-holonyms perform best respectively for predicting semantic similarity between rare words (RW), and for predicting semantic analogies (sem) in the Google dataset. Meronyms and holonyms are particularly effective for predicting semantic similarity between rare words. This result is important because it shows that a semantic lexicon can assist the representation learning of rare words, among which the co-occurrences are small even in large corpora [Luong et al.2013], The fact that the proposed method could significantly improve performance on this task empirically justifies our proposal for using a semantic lexicon in the word representation learning process. Table 1 shows that not all relation types are equally useful for learning word representations for a particular task. For example, hypernyms and hyponyms report lower scores compared to the corpus only baseline on predicting semantic similarity for rare (RW) and ambiguous (SCWS) word-pairs.
|Retro (corpus only)||0.786||0.673||61.11||68.14|
In Table 2, we compare the proposed method against previously proposed word representation learning methods that use a semantic lexicon: RCM is the relational constrained model proposed by Yu:ACL:2014 Yu:ACL:2014, R-NET, C-NET, and RC-NET are proposed by Xu:2014 Xu:2014, and respectively use relational information, categorical information, and their union from the WordNet for learning word representations, and Retro is the retrofitting method proposed by faruqui-EtAl:2015:NAACL-HLT faruqui-EtAl:2015:NAACL-HLT. Details of those methods are described in Section 2. For Retro, we use the publicly available implementation444https://github.com/mfaruqui/retrofitting by the original authors, and use pre-trained word representations on the same ukWaC corpus as used by the proposed method. Specifically, we retrofit word vectors produced by CBOW (Retro (CBOW)), and skip-gram (Retro (SG)). Moreover, we retrofit the word vectors learnt by the corpus only baseline (Retro (corpus only)) to compare the proposed joint learning approach to the post-processing approach in retrofitting. Unfortunately, for RCM, R-NET, C-NET, and RC-NET their implementations, nor trained word vectors were publicly available. Consequently, we report the published results for those methods. In cases where the result on a particular benchmark dataset is not reported in the original publication, we have indicated this by a dash in Table 2.
Among the different semantic relation types compared in Table 1, we use the synonym relation which reports the best performances for the proposed method in the comparison in Table 2. All word embeddings compared in Table 2 are dimensional and use the WordNet as the sentiment lexicon. From Table 2, we see that the proposed method reports the best scores on all benchmarks. Except for the smaller (only 65 word-pairs) RG dataset where the performance of retrofitting is similar to that of the proposed method, in all other benchmarks the proposed method statistically significantly outperforms prior work that use a semantic lexicon for word representation learning.
We evaluate the effect of the dimensionality on the word representations learnt by the proposed method. For the limited availability of space, in Figure 1 we report results when we use the synonymy relation in the proposed method and on the semantic similarity benchmarks. Similar trends were observed for the other relation types and benchmarks. From Figure 1 we see that the performance of the proposed method is relatively stable across a wide range of dimensionalities. In particular, with as less as dimensions we can obtain a level of performance that outperforms the corpus only baseline. On RG, MC, and MEN datasets we initially see a gradual increase in performance with the dimensionality of the word representations. However, this improvement saturates after dimensions, which indicates that it is sufficient to consider dimensional word representations in most cases. More importantly, adding new dimensions does not result in any decrease in performance.
To evaluate the effect of the corpus size on the performance of the proposed method, we select a random subset containing of the sentences in the ukWaC corpus, which we call the small corpus, as opposed to the original large corpus. In Figure 2, we compare three settings: corpus (corresponds to the baseline method for learning using only the corpus, without the semantic lexicon), synonyms (proposed method with synonym relation), and part-holonyms (proposed method with part-holonym relation). Figure 2 shows the Spearman correlation coefficient on the MEN dataset for the semantic similarity prediction task. We see that in both small and large corpora settings we can improve upon the corpus only baseline by incorporating semantic relations from the WordNet. In particular, the improvement over the corpus only baseline is more prominent for the smaller corpus than the larger one. Similar trends were observed for the other relation types as well. This shows that when the size of the corpus is small, word representation learning methods can indeed benefit from a semantic lexicon.
We proposed a method for using the information available in a semantic lexicon to improve the word representations learnt from a corpus. For this purpose, we proposed a global word co-occurrence prediction method using the semantic relations in the lexicon as a regularizer. Experiments using ukWaC as the corpus and WordNet as the semantic lexicon show that we can significantly improve word representations learnt using only the corpus by incorporating the information from the semantic lexicon. Moreover, the proposed method significantly outperforms previously proposed methods for learning word representations using both a corpus and a semantic lexicon in both a semantic similarity prediction task, and a word analogy detection task. The effectiveness of the semantic lexicon is prominent when the corpus size is small. Moreover, the performance of the proposed method is stable over a wide-range of dimensionalities of word representations. In future, we plan to apply the word representations learnt by the proposed method in downstream NLP applications to conduct extrinsic evaluations.
- [Baker et al.1998] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The berkeley framenet project. In Proc. of ACL-COLING, pages 86–90, August 1998.
- [Baroni et al.2014] Marco Baroni, Georgiana Dinu, and Germán Kruszewski. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proc. of ACL, pages 238–247, 2014.
- [Bollegala. et al.2007] Danushka Bollegala., Yutaka Matsuo, and Mitsuru Ishizuka. An integrated approach to measuring semantic similarity between words using information available on the web. In Proc. of NAACL-HLT, pages 340–347, 2007.
- [Bollegala et al.2014] Danushka Bollegala, Takanori Maehara, Yuichi Yoshida, and Ken ichi Kawarabayashi. Learning word representations from relational graphs. In Proc. of AAAI, pages 2146 – 2152, 2014.
- [Bollegala et al.2015] Danushka Bollegala, Takanori Maehara, and Ken-ichi Kawarabayashi. Unsupervised cross-domain word representation learning. In Proc. of ACL, pages 730 – 740, 2015.
- [Bruni et al.2012] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. Distributional semantics in technicolor. In Proc. of ACL, pages 136–145, 2012.
[Collobert et al.2011]
Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu,
and Pavel Kuska.
Natural language processing (almost) from scratch.
Journal of Machine Learning Research, 12:2493 – 2537, 2011.
- [Duc et al.2011] Nguyen Tuan Duc, Danushka Bollegala, and Mitsuru Ishizuka. Cross-language latent relational search: Mapping knowledge across languages. In Proc. of AAAI, pages 1237 – 1242, 2011.
- [Duchi et al.2011] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121 – 2159, 2011.
- [Faruqui et al.2015] Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. Retrofitting word vectors to semantic lexicons. In Proc. of NAACL-HLT, pages 1606–1615, 2015.
- [Finkelstein et al.2002] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, z. Solan, G. Wolfman, and E. Ruppin. Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20:116–131, 2002.
- [Ganitkevitch et al.2013] Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. Ppdb: The paraphrase database. In Proc. of NAACL-HLT, pages 758–764, June 2013.
- [Huang et al.2012] Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. Improving word representations via global context and multiple word prototypes. In Proc. of ACL, pages 873 – 882, 2012.
- [Iacobacci et al.2015] Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. Sensembed: Learning sense embeddings for word and relational similarity. In Proc. of ACL, pages 95–105, 2015.
- [Johansson and Nieto Piña2015] Richard Johansson and Luis Nieto Piña. Embedding a semantic network in a word space. In Proc. of NAACL-HLT, pages 1428–1433, 2015.
- [Jurgens et al.2012] David A. Jurgens, Saif Mohammad, Peter D. Turney, and Keith J. Holyoak. Measuring degrees of relational similarity. In SemEval’12, 2012.
- [Le and Mikolov2014] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, pages 1188 – 1196, 2014.
- [Levy et al.2015] Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of Association for Computational Linguistics, 2015.
[Luong et al.2013]
Minh-Thang Luong, Richard Socher, and Christopher D. Manning.
Better word representations with recursive neural networks for morphology.In CoNLL, 2013.
- [Mikolov et al.2013a] Tomas Mikolov, Kai Chen, and Jeffrey Dean. Efficient estimation of word representation in vector space. CoRR, abs/1301.3781, 2013.
- [Mikolov et al.2013b] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proc. of NIPS, pages 3111 – 3119, 2013.
- [Mikolov et al.2013c] Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. Linguistic regularities in continous space word representations. In Proc. of NAACL, pages 746 – 751, 2013.
- [Miller and Charles1998] G. Miller and W. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28, 1998.
- [Miller1995] George A. Miller. Wordnet: A lexical database for english. Communications of the ACM, 38(11):39 – 41, 1995.
- [Pennington et al.2014] Jeffery Pennington, Richard Socher, and Christopher D. Manning. Glove: global vectors for word representation. In Proc. of EMNLP, pages 1532 – 1543, 2014.
- [Rubenstein and Goodenough1965] H. Rubenstein and J.B. Goodenough. Contextual correlates of synonymy. Communications of the ACM, 8:627–633, 1965.
- [Socher et al.2012] Richard Socher, Brody Huval, Christopher D. Manning, and Andrew Y. Ng. Semantic compositionality through recursive matrix-vector spaces. In Proc. of EMNLP, pages 1201–1211, 2012.
[Turian et al.2010]
Joseph Turian, Lev Ratinov, and Yoshua Bengio.
Word representations: A simple and general method for semi-supervised learning.In ACL, pages 384 – 394, 2010.
- [Turney and Pantel2010] Peter D. Turney and Patrick Pantel. From frequency to meaning: Vector space models of semantics. Journal of Aritificial Intelligence Research, 37:141 – 188, 2010.
- [Xu et al.2014] Chang Xu, Yalong Bai, Jiang Bian, Bin Gao, Gang Wang, Xiaoguang Liu, and Tie-Yan Liu. Rc-net: A general framework for incorporating knowledge into word representations. In Proc. of CIKM, pages 1219–1228, 2014.
- [Yu and Dredze2014] Mo Yu and Mark Dredze. Improving lexical embeddings with semantic knowledge. In Proc. of ACL, pages 545 – 550, 2014.