Lexical entailment is concerned with identifying the semantic relation, if any, holding between two words, as in (pigeon, hyponym, animal). The popularity of the task stems from its potential relevance to various NLP applications, such as question answering and recognizing textual entailment Dagan et al. (2013) that often rely on lexical semantic resources with limited coverage like Wordnet Miller (1995). Relation classifiers can be used either within applications or as an intermediate step in the construction of lexical resources which is often expensive and time-consuming.
Most methods for lexical entailment are distributional, i.e., the semantic relation holding between and
is recognized based on their distributional vector representations. While the first methods were unsupervised and used high-dimensional sparse vectorsWeeds and Weir (2003); Kotlerman et al. (2010); Santus et al. (2014), in recent years, supervised methods became popular Baroni et al. (2012); Roller et al. (2014); Weeds et al. (2014). These methods are mostly based on word embeddings Mikolov et al. (2013b); Pennington et al. (2014a) utilizing various vector combinations that are designed to capture relational information between two words.
While most previous work reported success using supervised methods, some questions remain unanswered: First, several works suggested that supervised distributional methods are incapable of inferring the relationship between two words, but rather rely on independent properties of each word Levy et al. (2015); Roller and Erk (2016); Shwartz et al. (2016), making them sensitive to training data; Second, it remains unclear what is the most appropriate representation and classifier; previous studies reported inconsistent results with Concat Baroni et al. (2012) and Diff Roller et al. (2014); Weeds et al. (2014); Fu et al. (2014), using various classifiers.
In this paper, we investigate the effectiveness of multiplicative features, namely, the element-wise multiplication Mult, and the squared difference Sqdiff
. These features, similar to the cosine similarity and the Euclidean distance, might capture a different notion of interaction information about the relationship holding between two words. We directly integrate them into some commonly used representations. For instance, we consider the concatenationthat might capture both the typicality of each word in the relation (e.g., if is a typical hypernym) and the similarity between the words.
We experiment with multiple supervised distributional methods and analyze which representations perform well in various evaluation setups. Our analysis confirms that integrating multiplicative features into standard representations can substantially boost the performance of linear classifiers. While the contribution over non-linear classifiers is sometimes marginal, they are expensive to train, and linear classifiers can achieve the same effect “cheaply” by integrating multiplicative features. The contribution of multiplicative features is mostly prominent in strict evaluation settings, i.e., lexical split Levy et al. (2015) and out-of-domain evaluation that disable the models’ ability to achieve good performance by memorizing words seen during training. We find that Concat Mult performs consistently well, and suggest it as a strong baseline for future research.
2 Related Work
Recent work questioned whether supervised distributional methods actually learn the relation between and or only separate properties of each word. Levy:15 claimed that they tend to perform “lexical memorization”, i.e., memorizing that some words are prototypical to certain relations (e.g., that animal is a hypernym, regardless of
). Roller:16 found that under certain conditions, these methods actively learn to infer hypernyms based on separate occurrences ofand in Hearst patterns Hearst (1992). In either case, they only learn whether and independently match their corresponding slots in the relation, a limitation which makes them sensitive to the training data Shwartz et al. (2017); Sanchez and Riedel (2017).
Levy:15 claimed that the linear nature of most supervised methods limits their ability to capture the relation between words. They suggested that using support vector machine (SVM) with non-linear kernels slightly mitigates this issue, and proposedKSIM, a custom kernel with multiplicative integration.
The element-wise multiplication has been studied by Weeds:14, but models that operate exclusively on it were not competitive to Concat and Diff on most tasks. Roller:14 found that the squared difference, in combination with Diff, is useful for hypernymy detection. Nevertheless, little to no work has focused on investigating combinations of representations obtained by concatenating various base representations for the more general task of lexical entailment.
We classify each word pair
to a specific semantic relation that holds for them, from a set of pre-defined relations (i.e., multiclass classification), based on their distributional representations.
3.1 Word Pair Representations
Given a word pair and their embeddings , we consider various compositions as feature vectors for classifiers. Table 1 displays base representations and combination representations, achieved by concatenating two base representations.
3.2 Word Vectors
We used -dimensional pre-trained word embeddings, namely, GloVe Pennington et al. (2014b) containing 1.9M word vectors trained on a corpus of web data from Common Crawl (42B tokens),111http://nlp.stanford.edu/projects/glove/ and Word2vec Mikolov et al. (2013a, c) containing 3M word vectors trained on a part of Google News dataset (100B tokens).222http://code.google.com/p/word2vec/ Out-of-vocabulary words were initialized randomly.
, namely, logistic regression withregularization (LR), SVM with a linear kernel (LIN), and SVM with a Gaussian kernel (RBF
). In addition, we trained multi-layer perceptrons with a single hidden layer (MLP). We compare our models against the KSIM model found to be successful in previous work Levy et al. (2015); Kruszewski et al. (2015). We do not include Roller:16’s model since it focuses only on hypernymy. Hyper-parameters are tuned using grid search, and we report the test performance of the hyper-parameters that performed best on the validation set. Below are more details about the training procedure:
For LR, the inverse of regularization strength is selected from .
For LIN, the penalty parameter of the error term is selected from .
For RBF, and values are selected from and , respectively.
For MLP, the hidden layer size is either 50 or 100, and the learning rate is fixed at
. We use early stopping based on the performance on the validation set. The maximum number of training epochs is 100.
For KSIM, and values are selected from and , respectively.
We evaluated the methods on four common semantic relation datasets: BLESS Baroni and Lenci (2011), K&H+N Necsulescu et al. (2015), ROOT09 Santus et al. (2016), and EVALution Santus et al. (2015). Table 2 provides metadata on the datasets. Most datasets contain word pairs instantiating different, explicitly typed semantic relations, plus a number of unrelated word pairs (random). Instances in BLESS and K&H+N are divided into a number of topical domains.333We discarded two relations in EVALution with too few instances and did not include its domain information since each word pair can belong to multiple domains at once.
3.5 Evaluation Setup
We consider the following evaluation setups:
We randomly split each dataset into 70% train, 5% validation and 25% test.
Lexical Split (LEX)
In line with recent work Shwartz et al. (2016), we split each dataset into train, validation and test sets so that each contains a distinct vocabulary. This differs from Levy:15 who dedicated a subset of the train set for evaluation, allowing the model to memorize when tuning hyper-parameters. We tried to keep the same ratio as in the random setup.
To test whether the methods capture a generic notion of each semantic relation, we test them on a domain that the classifiers have not seen during training. This setup is more realistic than the random and lexical split setups, in which the classifiers can benefit from memorizing verbatim words (random) or regions in the vector space (lexical split) that fit a specific slot of each relation.
Specifically, on BLESS and K&H+N, one domain is held out for testing whilst the classifiers are trained and validated on the remaining domains. This process is repeated using each domain as the test set, and each time, a randomly selected domain among the remaining domains is left out for validation. The average results are reported.
Table 3 summarizes the best performing base representations and combinations on the test sets across the various datasets and evaluation setups.444Due to the space limitation, we only show the results obtained with Glove. The trend is similar across the word embeddings. The results across the datasets vary substantially in some cases due to the differences between the datasets’ relations, class balance, and the source from which they were created. For instance, K&H+N is imbalanced between the number of instances across relations and domains. ROOT09 was designed to mitigate the lexical memorization issue by adding negative switched hyponym-hypernym pairs to the dataset, making it an inherently more difficult dataset. EVALution contains a richer set of semantic relations. Overall, the addition of multiplicative features improves upon the performance of the base representations.
Multiplicative features substantially boost the performance of linear classifiers. However, the gain from adding multiplicative features is smaller when non-linear classifiers are used, since they partially capture such notion of interaction Levy et al. (2015). Within the same representation, there is a clear preference to non-linear classifiers over linear classifiers.
The Only-y representation indicates how well a model can perform without considering the relation between and Levy et al. (2015). Indeed, in RAND, this method performs similarly to the others, except on ROOT09, which by design disables lexical memorization. As expected, a general decrease in performance is observed in LEX and OOD, stemming from the methods’ inability to benefit from lexical memorization. In these setups, there is a more significant gain from using multiplicative features when non-linear classifiers are used.
Word Pair Representations
Among the base representations Concat often performed best, while Mult seemed to be the preferred multiplicative addition. Concat Mult performed consistently well, intuitively because Concat captures the typicality of each word in the relation (e.g., if is a typical hypernym) and Mult captures the similarity between the words (where Concat alone may suggest that animal is a hypernym of apple). To take a closer look at the gain from adding Mult, Table 4 shows the performance of the various base representations and combinations with Mult using different classifiers on BLESS.555We also tried with multiplicative features but they performed worse.
5 Analysis of Multiplicative Features
We focus the rest of the discussion on the OOD setup, as we believe it is the most challenging setup, forcing methods to consider the relation between and . We found that in this setup, all methods performed poorly on K&H+N, likely due to its imbalanced domain and relation distribution. Examining the per-relation scores, we see that many methods classify all pairs to one relation. Even KSIM, the best performing method in this setup, classifies pairs as either hyper or random, effectively only determining if they are related or not. We therefore focus our analysis on BLESS.
To get a better intuition of the contribution of multiplicative features, Table 5 exemplifies pairs that were incorrectly classified by Concat (RBF) while correctly classified by Concat Mult (RBF), along with their cosine similarity scores. It seems that Mult indeed captures the similarity between and . While Concat sometimes relies on properties of a single word, e.g. classifying an adjective to the attribute relation and a verb to the event relation, adding Mult changes the classification of such pairs with low similarity scores to random. Conversely, pairs with high similarity scores which were falsely classified as random by Concat are assigned specific relations by
Interestingly, we found that across domains, there is an almost consistent order of relations with respect to mean intra-pair cosine similarity:
Since the difference between random (0.141) and other relations (0.279-0.426) was the most significant, it seems that multiplicative features help distinguishing between related and unrelated pairs. This similarity is possibly also used to distinguish between other relations.
We have suggested a cheap way to boost the performance of supervised distributional methods for lexical entailment by integrating multiplicative features into standard word-pair representations. Our results confirm that the multiplicative features boost the performance of linear classifiers, and in strict evaluation setups, also of non-linear classifiers. We performed an extensive evaluation with different classifiers and evaluation setups, and suggest the out-of-domain evaluation as the most suitable for the task. Directions for future work include investigating other compositions, and designing a neural model that can automatically learn such features.
We would like to thank Wei Lu for his involvement and advice in the early stage of this project, Stephen Roller and Omer Levy for valuable discussions, and the anonymous reviewers for their insightful comments and suggestions.
Vered is supported in part by an Intel ICRI-CI grant, the Israel Science Foundation grant 1951/17, the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1), the Clore Scholars Programme (2017), and the AI2 Key Scientific Challenges Program (2017).
- Baroni et al. (2012) Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. 2012. Entailment above the word level in distributional semantics. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 23–32, Avignon, France. Association for Computational Linguistics.
- Baroni and Lenci (2011) Marco Baroni and Alessandro Lenci. 2011. How we blessed distributional semantic evaluation. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, pages 1–10. Association for Computational Linguistics.
- Dagan et al. (2013) Ido Dagan, Dan Roth, Mark Sammons, and Fabio Massimo Zanzotto. 2013. Recognizing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1–220.
- Fu et al. (2014) Ruiji Fu, Jiang Guo, Bing Qin, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Learning semantic hierarchies via word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1199–1209, Baltimore, Maryland. Association for Computational Linguistics.
- Hearst (1992) Marti A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2, COLING ’92, pages 539–545.
- Kotlerman et al. (2010) Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-geffet. 2010. Directional distributional similarity for lexical inference. Natural Language Engineering, 16(4):359–389.
- Kruszewski et al. (2015) GermÃ¡n Kruszewski, Denis Paperno, and Marco Baroni. 2015. Deriving boolean structures from distributional vectors. Transactions of the Association for Computational Linguistics, 3:375–388.
- Levy et al. (2015) Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. 2015. Do supervised distributional methods really learn lexical inference relations? In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 970–976, Denver, Colorado. Association for Computational Linguistics.
Mikolov et al. (2013a)
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a.
Efficient estimation of word representations in vector space.Proceedings of Workshop at ICLR, 2013.
- Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Mikolov et al. (2013c) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013c. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.
- Miller (1995) George A. Miller. 1995. Wordnet: A lexical database for english. Communications of the ACM, 38(11):39–41.
- Necsulescu et al. (2015) Silvia Necsulescu, Sara Mendes, David Jurgens, Núria Bel, and Roberto Navigli. 2015. Reading between the lines: Overcoming data sparsity for accurate classification of lexical relationships. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, pages 182–192, Denver, Colorado. Association for Computational Linguistics.
Pennington et al. (2014a)
Jeffrey Pennington, Richard Socher, and Christopher Manning.
Glove: Global vectors for word representation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational Linguistics.
- Pennington et al. (2014b) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014b. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
- Roller and Erk (2016) Stephen Roller and Katrin Erk. 2016. Relations such as hypernymy: Identifying and exploiting hearst patterns in distributional vectors for lexical entailment. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2163–2172, Austin, Texas. Association for Computational Linguistics.
- Roller et al. (2014) Stephen Roller, Katrin Erk, and Gemma Boleda. 2014. Inclusive yet selective: Supervised distributional hypernymy detection. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1025–1036, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.
- Sanchez and Riedel (2017) Ivan Sanchez and Sebastian Riedel. 2017. How well can we predict hypernyms from word embeddings? a dataset-centric analysis. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 401–407, Valencia, Spain. Association for Computational Linguistics.
Santus et al. (2016)
Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin Lu, and Chu-Ren Huang.
Nine features in a random forest to learn taxonomical semantic relations.In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France.
- Santus et al. (2014) Enrico Santus, Alessandro Lenci, Qin Lu, and Sabine Schulte im Walde. 2014. Chasing hypernyms in vector spaces with entropy. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 38–42, Gothenburg, Sweden. Association for Computational Linguistics.
- Santus et al. (2015) Enrico Santus, Frances Yung, Alessandro Lenci, and Chu-Ren Huang. 2015. Evalution 1.0: an evolving semantic dataset for training and evaluation of distributional semantic models. In Proceedings of the 4th Workshop on Linked Data in Linguistics (LDL-2015), pages 64–69.
- Shwartz et al. (2016) Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Improving hypernymy detection with an integrated path-based and distributional method. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2389–2398, Berlin, Germany. Association for Computational Linguistics.
- Shwartz et al. (2017) Vered Shwartz, Enrico Santus, and Dominik Schlechtweg. 2017. Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 65–75, Valencia, Spain. Association for Computational Linguistics.
- Weeds et al. (2014) Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, and Bill Keller. 2014. Learning to distinguish hypernyms and co-hyponyms. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2249–2259, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.
- Weeds and Weir (2003) Julie Weeds and David Weir. 2003. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, chapter A General Framework for Distributional Similarity.