have been hugely successful in generating useful vector representation for words which preserve their distributional properties in a given corpora. Improving the quality of word embeddings have led to better performance in many downstream language tasks. Considering the widespread uses of word embeddings, there have been a lot of interest in improving the quality of these embeddings by leveraging lexical knowledge such as synonymy, hypernymy, hyponymy, troponymy and paraphrase relations. This is accompanied by the availability of large scale lexical knowledge available in WordNetMiller (1995) and Paraphrase Database (PPDB) Ganitkevitch et al. (2013).
In this paper, we propose two simple yet powerful approaches to incorporate lexical knowledge into the word embeddings. First, we propose a matrix factorization based approach which uses the idea of ‘sprinkling’ Chakraborti et al. (2006, 2007) semantic knowledge into the word co-occurrence matrix. Second, we identify the weaknesses of the retrofitting model Faruqui et al. (2014)
and propose a few modifications that improves the performance. We demonstrate the strength of the proposed models by showing significant improvements in two commonly used intrinsic language tasks - word similarity and analogy, and two extrinsic tasks - named entity recognition (NER) and part of speech tagging (POS).
2 Related Works
Learning of word embeddings that capture distributional information has been vital to many NLP tasks. Prediction-based methods such as skip-gram Mikolov et al. (2013a) and CBOW Bengio et al. (2003) use neural language modelling for predicting a given word given its context words (or vice-versa) and extract the learned weight vectors as word embeddings. On the other hand, count-based methods derive a co-occurrence matrix of words in the corpus and use matrix factorization techniques like SVD to extract word representations Levy and Goldberg (2014). GloVe Pennington et al. (2014)
uses co-occurrence matrix to train word embeddings such that the dot product between any two words is proportional to the log probability of their co-occurrence.
The models that incorporate lexical knowledge into the word embeddings can be broadly classified into two categories, namely post processing and joint learning. Post processing methods such asFaruqui et al. (2014); Mrkšić et al. (2016) take the pre-trained word embeddings and modify them by injecting semantic knowledge. The retrofitting method Faruqui et al. (2014) derives similarity constraints from WordNet and other resources to pull similar words closer together. Whereas, the counterfitting approach, Mrkšić et al. (2016) also tries to push the antonymous words away from each other. These approaches consider only one-hop neighbours’ relations. We improve upon this by considering multi-hop neighbours as well as use structural and information-based similarity scores to determine their relative importance in imposing similarty contraints to the word embeddings.
Joint learning approaches like Yu and Dredze (2014); Fried and Duh (2014); Vashishth et al. (2018) learn word embeddings by jointly optimizing distributional and relational information. For instance, in Yu and Dredze Yu and Dredze (2014), the objective function consists of both the original skip-gram objective as well as prior knowledge from semantic resources to learn improved lexical semantic embeddings. The recent work by Vashishth et al. Vashishth et al. (2018) uses Graph Convolutional Networks (GCNs) to learn relations between words and out-performs the previous methods in many language tasks.
Latent Semantic Indexing (LSI), also known as Latent Semantic Analysis (LSA), learns a distributional representation for words by performing Singular Value Decomposition (SVD) on the term-document matrix. However, the dimensions obtained from LSI are not optimal in a classification setting because it is agnostic to class label information of the training data. The sprinkling method introduced by Chakraborti et al.,Chakraborti et al. (2006) improves LSI by appending the class labels as extra features (terms) to the corresponding training documents. When LSI is carried out on this augmented term-document matrix, terms pertaining to the same class are pulled closer to each other. An extension of this method, called adaptive sprinkling Chakraborti et al. (2007), allows to control the importance of specific class labels by appending them multiple times to the term-document matrix. For instance, in case of double sprinkling, we append the class labels twice to the matrix thus improving the weakly supervised constraints imposed by class labels.
3 Proposed Models
In this section, we discuss the proposed models to incorporate semantic knowledge into word embeddings.
3.1 Ss-Ppmi & Dss-Ppmi
In this approach, we take advantage of Levy and Goldberg’s work Levy and Goldberg (2014) in which the authors have shown that the objective function used in Word2vec Mikolov et al. (2013a) implicitly factorizes a Shifted PPMI (SPPMI) matrix. While there are many methods that attempt to inject semantic knowledge into neural word embeddings, to the best of our knowledge, we have not come across any work that tries to inject semantic knowledge into the SPPMI matrix. In its original form, the SPPMI matrix captures only distributional information. Hence, we are interested in analysing the impact of injecting semantic knowledge into the SPPMI matrix and the effectiveness of the resulting word embeddings.
Inspired from Chakraborti et al. (2006, 2007), which exploits the class knowledge of the documents by ’sprinkling’ label terms into the term-document matrix before matrix factorization, we modify the SPPMI matrix by adding reachability information from lexical knowledge bases such as WordNet and PPDB. In the lexical graphs obtained from these knowledge bases, words are connected by edges representing relations such as synonymy, hypernymy, etc. We say that a word is reachable from another word if and only if there exists a path between them in the lexical graph. More formally, let be the size of the vocabulary. We define the reachability matrix to be a zero-one square matrix with each element indicating if word is reachable from word within
hops in the lexical knowledge graph.
We concatenate the reachability matrix with the SPPMI matrix to obtain Sprinkled Shifted - Positive PMI (SS-PPMI). We then perform SVD on this augmented matrix to obtain the enriched word embeddings.
where denotes the matrix concatenation operation, denotes the number of negative samples and denotes the lower rank approximation of the SS-PPMI matrix. SS-PPMI matrix is of dimensions . Following the work of Levy et al., Levy and Goldberg (2014), we have used as 0.5 to obtain the word embeddings.
The original motivation for sprinkling technique Chakraborti et al. (2006) was that documents of same class are brought closer by appending the class labels to term-document matrix. Likewise, words which have strong syntactic relations such as synonymy or antonymy have similar neighbourhood in graphs like WordNet. This translates to these word pairs having similar columns in the reachability matrix. Thus, appending reachability matrix to SPPMI matrix would bring such words closer.
We can further strengthen these constraint by adding the reachability matrix multiple times as done in adaptive sprinkling Chakraborti et al. (2007). We performed experiments adding reachability matrix twice and we call the resulting matrix as Doubly Sprinkled Shifted - Positive PMI (DSS-PPMI), which will be of dimensions .
Retrofitting was introduced by Faruqui et al., Faruqui et al. (2014) and is a method to add semantic information to pre-trained word vectors. The post-processing step modifies the word embeddings such that the embeddings of words with semantic relations between them are pulled towards each other. Formally, given the pre-trained vectors = , and a knowledge base represented by the adjacency matrix , we need to learn new vectors = such that following objective is minimized:
The objective is a convex function and we can find the solution using the efficient iterative update method used in Faruqui et al., Faruqui et al. (2014):
The term is usually assigned as . This choice of assigning weights can be done in a better way by learning from semantic knowledge sourcea such as WordNet.
We propose a modification to the retrofitting methods called W-Retrofitting (weighted retrofitting), where we use WordNet-based similarity scores to obtain a better setting of . For two words and with WordNet similarity score , is obtained by normalizing the similarity scores across neighbors and is given as: . Since a word can have multiple synsets, the similarity score is the maximum of the similarity scores of all possible pairs of synsets, taking one each from the two words. For information based similarity measures like Lin similarity we compute mutual information from a random subset of Wikipedia corpus containing 100,000 articles. Further, we extend our method to consider nodes which are atmost 2 hops from given node when computing weights.
|Similarity||RG65, WS353S, Simlex-999|
|No Distinguishing||MEN, RW, MTunk, WS353|
4 Experimental Setup
4.1 Intrinsic Evaluation
We evaluate the proposed models on word similarity and analogy tasks.
Word similarity: We use MEN Bruni et al. (2014), MTunk Radinsky et al. (2011), RG65 Rubenstein and Goodenough (1965), Rare Words(RW) Luong et al. (2013), SimLex999 Hill et al. (2015), TR9856 Levy et al. (2015b), WS353 Finkelstein et al. (2002)
, WS353S (Similarity), WS353R (Relatedness). Spearman correlation is used as evaluation metric.
4.2 Sources of Knowledge
We used two sources of semantic knowledge: WordNet Miller (1995) and PPDB Ganitkevitch et al. (2013). We used the same PPDB knowledge source used in Faruqui et al., Faruqui et al. (2014). We used WordNet source knowledge from V. Batagelj V. Batagelj (2004). The relations considered are synonymy, hypernymy, meronymy and verb entailment. PPDB has 84467 nodes and 169703 edges, WordNet source we used has 82313 nodes and 98678 edges.
We used the latest Wikipedia dump111https://dumps.wikimedia.org/enwiki/latest/ containing 6 Billion wikipedia articles to generate the SPPMI matrix. We followed the same procedure as given in Levy et al., Levy et al. (2015a) and chose the number of negative samples to be default value of 5. In all of our experiments, we chose embedding dimension as 300, which is commonly used in the literature.
We use the following baselines for comparison
Retrofit: We apply the retrofitting technique Faruqui et al. (2014) on the GloVe embeddings where Wordnet or PPDB was as the source of word relations.
SPPMI: We perform SVD on the Shifted PPMI matrix (as mentioned in Section 3) without sprinkling.
SynGCN Vashishth et al. (2018): This work uses Graph-convolution based methods to impart relational information between words and have shown state-of-art results in many benchmarks. We directly report the available results from the original paper which uses same evaluation benchmarks.
4.4 Extrinsic Evaluation
To further test the effectiveness of the different methods in grounding word meanings, we utilize the embeddings in following tasks. The neural network architectures used for each of the tasks are same as that used in Vashishth et al., Vashishth et al. (2018).
5 Results and Analysis
Reachability Matrix is powerful in capturing semantic information: We proposed a simple sprinkling approach in which a zero-one matrix captures the -hop reachability information between words in a lexical knowledge graph. In order to see how effectively the reachability matrix captures the lexical knowledge, we performed SVD on the reachability matrix and obtained the word embeddings. Table 2 shows the performance of the obtained embeddings on word similarity task, The dimension of embedding used is 300. Interestingly, we clearly observe that the embeddings obtained from the reachability matrix only (without SPPMI matrix) compete strongly with 300 dimensional pretrained GloVe embeddings on the similarity based datasets. The best performing model gives a Spearman correlation which is 0.19 more than GloVe in Simlex999. Similarly, in RG65 and WS353S, the reachability based embeddings compete well with GloVe. Between the choice of PPDB or WordNet as the lexical knowledge sources, PPDB seems to be more helpful. In general, the performance of reachability-based embeddings increases with increasing the number of hops on the similarity datasets.
In the case of relatedness datasets, the model competes poorly with the baseline-GloVe. This is quite expected as the reachability matrix doesn’t capture any information about the word co-occurrence. These observations have been foundational to our proposed SS-PPMI and DSS-PPMI methods.
|Lexical Knowledge||Hops - k||SimLex999||WS353S||RG65||WS353R||TR9856||WS353||MEN||MTurk||RW|
|Baseline - GloVe||-||0.370||0.665||0.769||0.560||0.575||0.601||0.737||0.633||0.411|
SS-PPMI and DSS-PPMI provide significant improvements in word similarity and analogy: Table 3 provides the results with SS-PPMI and DSS-PPMI approaches on word similarity task with embedding dimension as 300. We clearly observe that the proposed models defeat the baseline in all the datasets. The margin of improvement is quite high in case of similarity datasets. We see close to 0.21 increase in spearman correlation for Simlex999, 0.04 increase in RG65. This is somewhat expected as we already saw that reachability matrix contains lexical information. Interestingly, we also saw improvements in relatedness datasets where the sprinkling approaches perform narrowly better than SPPMI based approach. In other datasets like WS353, MEN we see improvements of about 0.02 and 0.03 in spearman correlation respectively. Overall, sprinkling significantly improves the performance on word similarity task.
Overall, we observe that Double Sprinkling method (DSS-PPMI) works better than SPPMI in word similarity task. Increasing the number of hops () in the reachability matrix improves the performance in word similarity , in general.
Table 4 shows improvements provided by the sprinkling methods on analogy datasets. We observe marginal improvements over baseline in google and SemEval2012.
We apply our W-retrofitting model to GloVe Pennington et al. (2014)
embeddings trained on Wikipedia corpus. We experimented with one hop and two hop neighbors and several methods for similarity estimation: inverse path similarity, Jaing-Conrath SimilarityJiang and Conrath (1997), Wu -Palmer Similarity Wu and Palmer (1994), Leacock-Chowdorov Similarity Leacock and Chodorow (1998) and Lin Similarity Lin and others (1998). The neighbourhood information for estimating similarity was obtained from either WordNet or PPDB graphs. We found that Jaing-Conrath Similarity works best for WordNet, inverse path similarity for PPDB. So, we report results for these similarity measures only.
The performances of all our models are either comparable or superior to baselines as seen in table 5. We see that using PPDB knowledge source and path based similarity as weights in the retrofit objective functions gives the best performance and outperforms the baselines in most benchmarks.
Some of our models outperform retrofitting baselines in Google analogy. In SemEval task, we mostly outperform GloVe but retrofitting baseline on WordNet gives the best score. The results are summarised in table 6
5.3 Overall Comparison on Word Similarity
In order to make fair and direct comparison between Sprinkling and Retrofitting, we applied retrofitting and W-retrofitting (using inverse-path similarity over PPDB graph) on the 300 dimensional SPPMI vectors. Table 7 provides the best results of the models on each of the word similarity and analogy datasets. We make the following observations. W-Retrofitting does much better than Retrofitting in similarity datasets, as what we saw with GloVe embeddings. The source of the improvement comes comes from two things: inclusion of two-hop neighbor information and the intelligent choice of weights from WordNet in W-Retrofitting.
Using only the Reachability Matrix provides very good scores in similarity based datasets, but doesn’t capture relatedness information at all. Using sprinkling approach, we manage to obtain embeddings that have optimal combination of similarity and relatedness information and this makes it perform better than all the other baselines in similarity, relatedness and analogy tasks.
5.4 Evaluation on Extrinsic tasks
The results on extrinsic tasks (discussed in Section 4.4) are given in Tables 8 and 9. In the case of sprinkling methods, we see that there is a clear increase in scores for both the extrinsic tasks from using the proposed SS-PPMI matrix over using only the SPPMI matrix. We also see that models using PPDB perform better. One reason why we do not compare scores of sprinkling based methods with that of GloVe and Retrofitting based ones is that the vocabulary size(number of nodes) in PPDB or Wordnet graphs are lower than that for GloVe. We also didn’t consider punctuation symbols in SPPMI unlike GloVe.
In the case of W-retrofitting, scores from the proposed W-Retrofitting model using jcn weights on wordnet graph are very similar to SynGCN model inspite of SynGCN being a more complex model with a lot of hyperparameters. We also see that the other methods of W-retrofitting have comparable performance to SynGCN. We observe improved performance by considering upto 2 hop neighbours over methods considering just 1 hop neighbours. It is quite interesting to see that the proposed light-weight retrofitting model competes strongly with the more complex SynGCN method as shown by the results in Table 9.
6 Conclusion and Future Work
In this paper, we proposed two simple yet powerful approaches to incorporate lexical knowledge into word embeddings. The first approach is a matrix factorization method that ‘sprinkles’ higher order graph information into the word co-occurrence and we show that it significantly improves the quality of the word embeddings. Second, we proposed a simple modification to the retrofitting method that improves it performance visibly. We showed the improvements of the proposed models over baselines in a variety of word similarity and analogy tasks, and across two popular lexical knowledge bases.
For extrinsic tasks, W-retrofitting showed comparable performance to the state-of-art SynGCN model, Vashishth et al. (2018)
inspite of SynGCN being a more sophisticated model with lots of parameters that constitute the weights of Graph Convolutional layers and linear layers of neural network used as well as many hyperparameters needed for training the neural network (such as number of GCN layers and their dimensions, learning rate, number of epochs, etc.).
In our sprinkling approach, we didn’t consider any importance weighting for different relations. One promising direction that can be experimented in future is to use wordnet similarity scores or a combination of co-occurrence and lexical information as importance values in the reachability matrix. We could also use ‘adaptive sprinkling’ Chakraborti et al. (2007) to give more importance to relations of specific sets of words.
The more recent methods that achieve the state-of-art results in a variety of language tasks utilize pre-trained models such as Elmo Peters et al. (2018), BERT Devlin et al. (2018) and XLNet Yang et al. (2019). These models that learn context dependent word embeddings are pre-trained for different language tasks and are later fine-tuned for specific tasks. Another direction of research we would like to explore is to study the improvements gained by using our proposed models to initialize the word embeddings before pre-training these models.
A neural probabilistic language model.
Journal of machine learning research3 (Feb), pp. 1137–1155. Cited by: §2.
Multimodal distributional semantics.
Journal of Artificial Intelligence Research49, pp. 1–47. Cited by: §4.1.
- Sprinkling: supervised latent semantic indexing. In European Conference on Information Retrieval, pp. 510–514. Cited by: §1, §2, §3.1, §3.1.
- Supervised latent semantic indexing using adaptive sprinkling.. In IJCAI, Vol. 7, pp. 1582–1587. Cited by: §1, §2, §3.1, §3.1, §6.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §6.
Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166. Cited by: §1, §2, §3.2, item 2, §4.2.
- Placing search in context: the concept revisited. ACM Transactions on information systems 20 (1), pp. 116–131. Cited by: §4.1.
- Incorporating both distributional and relational semantics in word representations. arXiv preprint arXiv:1412.4369. Cited by: §2.
- PPDB: the paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 758–764. Cited by: §1, §4.2.
- Simlex-999: evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41 (4), pp. 665–695. Cited by: §4.1.
- How to evaluate word embeddings? on importance of data efficiency and simple supervised tasks. arXiv preprint arXiv:1702.02170. Cited by: §4.1.
- Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008. Cited by: §5.2.
- Combining local context and wordnet similarity for word sense identification. WordNet: An electronic lexical database 49 (2), pp. 265–283. Cited by: §5.2.
- Higher-order coreference resolution with coarse-to-fine inference. In NAACL-HLT, Cited by: item 2.
- Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3, pp. 211–225. Cited by: §4.2.
- Neural word embedding as implicit matrix factorization. In Advances in neural information processing systems, pp. 2177–2185. Cited by: §2, §3.1, §3.1.
Tr9856: a multi-word term relatedness benchmark.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Vol. 2, pp. 419–424. Cited by: §4.1.
- An information-theoretic definition of similarity.. In Icml, Vol. 98, pp. 296–304. Cited by: §5.2.
- Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113. Cited by: §4.1.
- The penn treebank: annotating predicate argument structure. In HLT, Cited by: item 1.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1, §2, §3.1, §4.1.
- Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Cited by: §4.1.
- WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §1, §4.2.
- Counter-fitting word vectors to linguistic constraints. arXiv preprint arXiv:1603.00892. Cited by: §2.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1, §2, item 1, §5.2.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §6.
- A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, pp. 337–346. Cited by: §4.1.
- Reporting score distributions makes a difference: performance study of lstm-networks for sequence tagging. In EMNLP, Cited by: item 1.
- Contextual correlates of synonymy. Communications of the ACM 8 (10), pp. 627–633. Cited by: §4.1.
- Introduction to the conll-2003 shared task: language-independent named entity recognition. ArXiv cs.CL/0306050. Cited by: item 2.
- WordNet transformed in pajek format, http://vlado.fmf.uni-lj.si/pub/networks/data./dic/wordnet/wordnet.htm, accessed on 10 april, 2019. External Links: Cited by: §4.2.
- Incorporating syntactic and semantic information in word embeddings using graph convolutional networks. In ACL, Cited by: §2, item 4, §4.4, §6.
- Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pp. 133–138. Cited by: §5.2.
- XLNet: generalized autoregressive pretraining for language understanding. ArXiv abs/1906.08237. Cited by: §6.
- Improving lexical embeddings with semantic knowledge. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 545–550. Cited by: §2.