Preserving the Hypernym Tree of WordNet in Dense Embeddings

04/22/2020 ∙ by Canlin Zhang, et al. ∙ Florida State University 0

In this paper, we provide a novel way to generate low-dimension (dense) vector embeddings for the noun and verb synsets in WordNet, so that the hypernym-hyponym tree structure is preserved in the embeddings. We call this embedding the sense spectrum (and sense spectra for embeddings). In order to create suitable labels for the training of sense spectra, we designed a new similarity measurement for noun and verb synsets in WordNet. We call this similarity measurement the hypernym intersection similarity (HIS), since it compares the common and unique hypernyms between two synsets. Our experiments show that on the noun and verb pairs of the SimLex-999 dataset, HIS outperforms the three similarity measurements in WordNet. Moreover, to the best of our knowledge, the sense spectra is the first dense embedding system that can explicitly and completely measure the hypernym-hyponym relationship in WordNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

WordNet is a lexical database for the English language Miller et al. (1990), which groups English words into sets of synonyms called Miller (1995). Each synset is related to a specific semantic sense, and synsets related to the same semantic sense are usually ordered by their usage frequencies in English. There are four types of synsets in WordNet: noun (n), verb (v), adjective (a) and adverb (r). As a result, a synset in WordNet is represented in the form of “semantic sense.type.ordering”. For instance, means the first noun synset related to the semantic sense “domestic animal”, and means the third verb synset related to the semantic sense “eat”.

WordNet can be regarded as a dictionary, since it provides brief definitions and usage examples for each synset. On the other hand, WordNet can also be regarded as a thesaurus Boyd-Graber et al. (2006), since it records a number of semantic relationships among synsets or their members (called ). The most important relationship among synsets in WordNet is the hypernym-hyponym relationship Yamada et al. (2009), which indicates the generic term (hypernym) and a specific instance of it (hyponym). In fact, the hypernym-hyponym relationship is very complicated, which consists of a tree-like structure Miller et al. (1990) with nodes to be the synsets. Only noun and verb synsets in WordNet possess the hypernym-hyponym relationship Miller and Hristea (2006).

Since almost all the state-of-the-art Natural Language Processing (NLP) models are built on embeddings

Mikolov et al. (2013); Devlin et al. (2018), it is desirable to represent the synsets in WordNet by embeddings as well. To be specific, low dimensional embeddings (dense embeddings) that can preserve the semantic relationships among synsets in WordNet are especially desired, which has not been fully realized. Also, the hypernym-hyponym relationship is regarded as the most important relationship in WordNet Miller (1995). So, it will be valuable if we generate dense embeddings that can completely preserve the hypernym-hyponym tree structure for noun and verb synsets in WordNet.

Hence, we first design the hypernym intersection similarity (HIS) as the desired measurement, by which the “commonness” and “differences” between two noun or verb synsets are measured according to the intersection situation of their hypernym sets. Using HIS as labels, we train the synset embeddings with a novel operation other than the inner product, which preserves the HIS measurement (and hence the hypernym-hyponym tree) in the synset embeddings. This training method makes our embedding vector looks like a “spectrum of senses”. So, we call it the sense spectrum. After training, the same operation is used to measure the hypernym-hyponym tree structure preserved by sense spectra.

In the next section, we shall discuss the related work on creating embeddings for WordNet synsets. Then in Section 3, we shall introduce the architectures of our model. In Section 4, we will describe our implementations and provide experimental results. Then in Section 5, we will provide further discussions on our model. Finally,we will conclude the paper with a brief summary in Section 6.

2 Related Work

Roughly speaking, there are two traditional ways to create embeddings for WordNet synsets: One way is to combine the embeddings of words appeared in the definition or usage examples of that synset, where pre-trained word embeddings from other models are required Rothe and Schutze (2015). Synset embeddings created in this way are dense, yet preserve no semantic relationships. Another way is to keep each synset in one unique dimension of the embedding vector, and then create a binary matrix recording the existence (or not) of one specific semantic relationship between any two synsets (two dimensions). Synset embeddings created in this way do preserve the semantic relationships. But these embeddings are high dimensional several-hot vectors Bengio et al. (2003)

, which often lead to over-fitting when being used as neural network inputs

Rojas (2015).

Therefore, many novel methods are designed to generate dense embeddings that can preserve the semantic relationships in WordNet. Among them, the wnet2vec Saedi et al. (2018) provides the model that is most similar to ours. Simply speaking, wnet2vec performs Principle Component Analysis (PCA) Esbensen and Geladi (1987) on the binary matrix recording the semantic relationships to obtain compressed synset vectors. But in this way, the semantic relationships are only implicitly preserved in the compressed vectors, which cannot be measured directly. Hence, we hope to design a synset embedding system with a measurement, so that the semantic relationships can be not only preserved in the dense embeddings but also explicitly measured by the measurement. This goal motivates our research on the sense spectrum.

3 Architectures

In the first subsection, we shall introduce the proposed HIS measurement. Then, in the next subsection, we shall introduce the three basic similarity measurements in WordNet, which will be used as comparisons to our HIS measurement. After that, the formulas and training algorithms of sense spectra will be given. Besides, we note that it is not very meaningful to compare a noun synset with a verb one. So, whenever we mention “two (noun or verb) synsets and ” in this paper, we assume that either both and are noun synsets, or both of them are verb ones.

3.1 Hypernym intersection similarity

Primarily, we note that WordNet not only provides the direct hypernym for each noun and verb synset, but also provides its Miller et al. (1990): Suppose is a direct hypernym of the synset , and is a direct hypernym of . Then, the hypernym closure of will contain both and . That is, the hypernym closure consists of “all the hypernyms of all the hypernyms” for the synset , which is denoted as in this paper.

For example, if we set synset to be and synset to be , their hypernym closures and are then shown in Figure 1:

Figure 1: (To be viewed in color) The hypernym closures of synsets and .

The synset denotes the common sense of “man”, whose definition in WordNet is “An adult person who is male (as opposed to a woman)”. Accordingly, is defined as “an adult female person (as opposed to a man)”. They have the same direct hypernym . We can see from Figure 1 that all the hypernyms of and are the same, except that is unique to and is unique to . Hence, the hypernym closures are the key to describe the hypernym-hyponym relationship between two synsets and .

However, we will not use the hypernym closure directly in the HIS measurement. This is because the semantic field of a synset should be smaller than that of its hypernym closure Gao and Xu (2013). Hence, based on the hypernym-hyponym relationship, the precise representation of a synset should be its hypernym closure plus the synset itself, which is the . We shall build the HIS measurement based on the hypernym set.

Then, for two synsets and , we define the “commonness” between them as , which can also be denoted as . And the “uniqueness” of synset is defined as , which consists of the hypernyms unique to the synset (the ones not in ). We denote as . Similarly, the “uniqueness” of synset is defined as . We call , and the Hypernym Representation Sets.

Taking and as our example again, we can see from Figure 1 that

Then, we set , and , which denotes the size of each hypernym representation set. Again, if and , we have that and .

Finally, for any two noun or verb synsets and , the (HIS) is defined as:

(1)

Here, the exponent parameters , and the scalar parameter are turned empirically. We find that “smoothing” the HIS scalars by adding exponent parameters less than one leads to better performance. Also, we find that reducing the importance of the “uniqueness” scalars and by adding a scalar parameter less than one leads to better performance as well. These are our empirical experiences behind the parameters.

The initial scalars of the HIS measurement will be used as labels in the training of sense spectra, which is introduced in Subsection 3.3.

3.2 Three basic synset similarities

There are three basic measurements on the similarity between two noun or verb synsets in WordNet: The Shortest Path Similarity, Leacock-Chodorow Similarity and Wu-Palmer Similarity Slimani (2013). They are “basic” since they only require the hypernym-hyponym relationship between two synsets Jones (1979), which is the same as the HIS measurement. Hence, they are used as the comparisons to our model.

: All the noun synsets share the same root hypernym . But there may be no common hypernym between two verb synsets. So, a fake root synset is added to the verb synsets.

Then, for any two noun or verb synsets and , there is always a hypernym-hyponym path connecting them through a common hypernym of them. And there is a shortest path among all these paths, whose length is denoted as . The Shortest Path Similarity between synsets and is then defined to be , which is between 0 and 1.

: The of a noun or verb synset , denoted as , is defined to be the length of the shortest path from to the root synset ( for noun and for verb). That is, .

Then, for two synsets and , the Leacock-Chodorow (LCH) Similarity is defined as

: For two noun or verb synsets and , their (LCS) is the common hypernym of and with the largest depth. We use or simply to denote the LCS of synsets and . That is, . Then, the Wu-Palmer (WP) Similarity between synsets and is defined as .

3.3 Sense spectrum

Suppose and are the embedding vectors of synsets and respectively. Then, we use the “overlapping” between and to represent the “commonness” between synsets and . The overlapping of two vectors is measured dimension-wise: Suppose and are the elements in the ’th dimension of and respectively. We use to represent the overlapping between and . Then, if and have the same sign (i.e., both of them are positive or both are negative), will equal to the one of and with the smaller absolute value. If and have different signs, will be zero. That is, mathematically:

(2)

where sgn is the sign function: when x is positive, zero and negative respectively.

Taking this operation to each dimension , we can get the overlapping vector by . We also denote as , which can be regarded as the vector representation on the “commonness” between synsets and . After obtaining , the vector representation on the “differences (uniqueness)” of synsets and is obvious: We use to represent the “uniqueness” of synset , and use to represent the “uniqueness” of synset .

We can see that each dimension in and operates independently to form , and . This makes our embedding vector looks like a “spectrum”, with its dimensions to be the measurements on specific senses. In fact, this is verified by experiments, which will be discussed in Section 5. As a result, we call our synset embedding the . For any two synsets and , we call the Commonness Spectrum, and we call , the Uniqueness Spectra. Figure 2 provides a clear exhibition on how to obtain , and based on the initial spectra and .

Figure 2: (To be viewed in color) The commonness spectrum as well as the uniqueness spectra and , based on the initial sense spectra and .

In Figure 2, we suppose a spectrum vector is dimension-three. We show in red and in green. According to our overlapping method, the dimension 1 of is the same as that of , which is also in red. Similarly, dimension 3 of is the same as that of , which is in green. But and are not overlapped in dimension 2, making that of to be zero. Hence, dimension 2 of and shall remain the same as in and respectively, since no cancellation is made from . Finally, dimension 1 of and dimension 3 of are cancelled to zero. But dimension 3 of and dimension 1 of are partially cancelled, which is in blue.

It is then only straightforward to figure out that the three spectra , and coincide with the initial HIS scalars , and : The commonness spectrum coincides with the scalar , while the two uniqueness spectra and coincide with scalars and , respectively. Hence, our training algorithm is as simple as:

(3)

where is the norm of a vector Cape et al. (2017).

We will show in the next section that after training, the hypernym-hyponym relationship between two noun or verb synsets and is preserved in their corresponding spectra and .

4 Evaluation

In this section, we show by experimental results that the hypernym intersection similarity outperforms the three basic similarity measurements in WordNet. And we will show that sense spectra indeed capture the structures of the hypernym-hyponym relationship in WordNet.

4.1 The performance of HIS

To estimate the performance of HIS measurement, we use the dataset SimLex-999

Hill et al. (2015), which contains 666 noun pairs, 222 verb pairs and 111 adjective pairs. Each pair of words in SimLex-999 is scored from 0 to 10: The higher the score is, the more similar the two words in that pair should be. All the scores are given manually by native English speakers. Table 1 provides a brief exhibition on the noun and verb pairs in SimLex-999.

Noun pairs Score Verb pairs Score
book text 6.35 listen hear 8.17
night day 1.88 go come 2.42
belief flower 0.40 spend save 0.55
Table 1: The noun and verb pairs as well as their corresponding scores in SimLex-999

However, there may be more than one synset related to a word Navigli (2009). For example, there are 11 noun synsets related to the word “book”, including (a written work or composition that has been published), (physical objects consisting of a number of pages bound together), (the sacred writings of the Christian religions), etc. So, we need to first choose the correct synsets for each word pair in SimLex-999.

For a word pair , suppose there are synsets related to , and synsets related to . Then, there are possible combinations of synsets for this word pair . We compute the HIS measurement by formula (1) on each of these synset combinations, and then choose the combination with the maximal HIS measurement score. After that, this chosen combination is regarded as the correct synset choice for the word pair , and the corresponding score is regarded as the HIS measurement score of , denoted as .

Finally, suppose is a specific set of word pairs in SimLex-999 (say, all the noun pairs). For each word pair , suppose is the manually given similarity score in SimLex-999, and is the similarity score under HIS measurement. Then, we compute the Spearman’s correlation McDonald (2009) between and as:

(4)

where and are the averages of and respectively. This Spearman’s correlation is then the estimation on the performance of the HIS measurement. A higher Spearman’s correlation here means that the language model can handle the semantic meanings of words more like humans do de Winter et al. (2016).

In order to obtain comparisons, we apply the same process onto the Shortest Path Similarity, Leacock-Chodorow (LCH) Similarity and Wu-Palmer (WP) Similarity. That is, we replace the similarity score by , and as described in Section 3.2 to get the corresponding Spearman’s correlation , and , respectively. Moreover, we work on three different sets of word pairs in SimLex-999: only noun pairs, only verb pairs, or combining both noun and verb ones. Results are shown in Table 2.

ModelGroup Noun pairs Verb pairs Both
HIS
Shortest Path 58.38 39.20 51.96
LCH 58.38 39.20 54.92
WP 55.00 37.84 48.82
Table 2: The Spearman’s correlations obtained by performing each similarity measurement on different sets of word pairs in SimLex-999.

We can see that the HIS measurement achieves the highest Spearman’s correlation on all the three sets of word pairs. To be specific, on the verb pairs, the HIS measurement outperforms the other three similarity measurements by 10 percent, which is a significant improvement.

Therefore, we claim that the HIS measurement captures the hypernym-hyponym relationship in WordNet better than the three basic similarities do. Hence, it is meaningful to use the initial HIS scalars , , as labels to train our sense spectra, whose performance is given in the following subsection.

4.2 The performance of sense spectra

Again, we note that it is meaningless to compare a noun synset with a verb one. So, the noun and verb spectra are generated and trained independently: There are 82,115 noun synsets, whose spectra are generated as ; And there are 13,767 verb synsets, whose spectra are . We always set the dimension of a spectrum to be for both noun and verb synsets.

When training the sense spectra, we use TensorFlow in Python

Abadi et al. (2016). We shall first introduce our methods of implementations. Then, we shall introduce our specific strategy on how to build a training batch. Finally, meaningful testing results will be given.

4.2.1 Implementation issues

When computing the dimension-wise overlapping , we realize that it is difficult to perform formula (2) directly in TensorFlow. This is because errors cannot path through the sign function by back propagation Rumelhart et al. (1986). Besides, there is no necessary to generate by each dimension in practice. So, we use a formula evolved with the rectifier function Agarap (2018) to compute directly:

To be specific, suppose the dimension of a spectrum vector is . We first concatenate the rectified vectors and along each dimension, which returns a tensor (matrix) . After that, we obtain the minimum value on each dimension of , which returns a dimensional vector . Similarly, we can get and with respect to and . Finally, we can get via .

On contrast, the formulas to obtain and in practice are much more straightforward:

After that, we compute the norm of a dimensional vector as

Finally, suppose , and . Applying the initial HIS scalars , , as labels, we complete the training by minimizing the error via AdamOptimizer Kingma and Ba (2014).

4.2.2 Batch formation strategies

Each batch in our model consists of a synset pair , which is formed in three different ways:

The direct hypernym pair: After choosing a synset randomly, we pick its direct hypernym to form the pair . If there are more than one direct hypernyms for the synset , we shall choose one of them randomly.

The semantic sense related pair: As we mentioned in the introduction, each synset is related to a specific semantic sense. There are 67,176 noun semantic senses and 7,440 verb semantic senses that have more than one related synsets. In the training, we shall randomly pick one semantic sense and randomly choose two of its related synsets to form a semantic sense related pair. That is, suppose we get the semantic sense and its related synsets . Then, we randomly pick two synsets from to from the pair .

Random pair: We choose two synsets and randomly to form the pair .

We have pairs built in each of these three ways. So, our total batch size is . We always set in our training. And again, we note that noun and verb synset pairs are formed independently.

4.2.3 Testing results

After training, we look up the three closest spectra for each spectrum under the HIS measurement. That is, for a spectrum related to the synset , we compute , and with respect to every else spectrum . Then, we apply formula (1) on each scalar set to find the three synsets , and that provide the maximal values . Again, this procedure is performed on the noun and the verb spectra independently. Some results are shown in Table 3.

Synset Synset Synset Synset
8.04 8.03 7.59
8.98 8.90 8.88
7.06 7.02 7.01
6.80 6.32 6.29
4.99 4.87 4.55
6.22 6.21 6.20
1.91 1.87 0.93
0.98 0.97 0.97
4.00 3.93 3.92
2.99 2.93 2.93
Table 3: The three closest spectra for each sense spectrum under the HIS measurement.

Table 3 contains the synsets related to both commonly used words and specific terminologies. To be specific, synsets , and represents three different genera of plants, to which belongs; The synset is defined as “a torn part of a ticket returned to the holder as a receipt”; And the synset means “laugh unrestrainedly”. By these examples, we can see that sense spectra with similar meanings (corresponding to their synsets) are clustered together under the HIS measurement, which is similar to the performance of word2vec Mikolov et al. (2013). This result shows that the hypernym-hyponym relationship in WordNet can be explicitly measured on sense spectra via the HIS measurement.

However, one may ask: To what extend, or how precise, can sense spectra preserve the hypernym-hyponym relationship? In order to answer this question, we design the following operation: Given a pair of synsets and the corresponding spectra , we compare the initial HIS scalars with the spectrum-based HIS scalars by

(5)

The numerator of formula (5) represents the error made by the sense spectra, while the denominator represents the magnitude of the initial HIS scalars. Hence, the smaller is, the less important the error is comparing to the initial HIS scalars, and hence the more precisely sense spectra can capture the hypernym-hyponym relationship between their corresponding synsets .

Then, we perform formula (5) on SimLex-999 dataset. That is, we obtain the correct synset pair (the pair with the maximum HIS measurement score) for each noun and verb pair in SimLex-999. After that, we compute for each pair . The distribution of is described by the histograms in Figure 3. And some statistical results are given in Table 4.

Figure 3: The histogram distribution of the values .
Range of
Percentage of Pairs
Table 4: The statistical results of the values .

We can see from Figure 3 and Table 4 that more than of the synset pairs have a value significantly less than one. That is, in most SimLex-999 synset pairs , the sense spectra can capture the hypernym-hyponym relationship between and precisely. Taking the authority of the SimLex-999 dataset, we claim that in general, our sense spectra preserve the hypernym-hyponym relationships among their corresponding synsets precisely.

That is, given any two spectra, we can precisely recover the hypernym-hyponym relationship between their corresponding synsets. Hence, if we work on each and every pair of sense spectra, we can recover the hypernym-hyponym tree in WordNet. This indicates that the hypernym-hyponym tree structure in WordNet is completely preserved in sense spectra. To the best of our knowledge, this is the first time that low dimensional embeddings can explicitly and completely preserve the semantic relationships in WordNet.

Finally, in Figure 4, we plot the spectra for the synsets in Table 3 that are related to (row 1) and (row 7). Different from the vertical spectra in Figure 2, we plot horizontal spectra here to save spaces.

Figure 4: Spectra for the synsets related to and in Table 3.

We can see from Figure 4 that verb spectra are in general sparser than the noun ones. This is because the hypernym-hyponym relationships among verb synsets are more concise comparing to those among noun synsets. Hence, fewer dimensions in a spectrum are enough to preserve the semantic information. In addition, spectra with similar meanings (corresponding to their synsets) also have similar distributions across dimensions. That is, in most dimensions, spectra with similar meanings tend to have the same sign and the same magnitude. These phenomena are in fact meaningful, which will be discussed in the next section.

: Our code can be accessed via https://github.com/canlinzhang/Sense-Spectrum.

5 Discussions

In this section, we shall further discuss the meaning behind our experimental results, based on which we shall describe about the potential applications of the sense spectra.

5.1 Building hierarchical language model based on sense spectra

As we mentioned in the previous section, sense spectra with similar meanings tend to have the same sign and magnitude in most dimensions. However, since the number of dimensions in a spectrum is far less than the number of (noun or verb) synsets, it is impossible for each dimension to preserve semantic information independently. As a result, we can conclude that specific combinations of dimensions in a sense spectrum work together to preserve specific semantic information. That is, there exists structures related to semantic senses among the dimensions in sense spectra, indicating the name sense spectrum is fair and genuine.

Then, combining with text training corpus Liu and Curran (2006), it is possible to build hierarchical language model based on the structures in sense spectra. For example, we may first group together the dimensions that co-occur frequently across noun or verb spectra. These dimension groups should be highly related to the dimension combinations preserving specific semantic information. Then, for each word in the training corpus, we may find all the related synsets and put the corresponding spectra into a list. After that, we will have a list of the possible spectra for each word in the training corpus. Finally, based on this, we may discover the “groups of dimension groups.” That is, we further group together the dimension groups that co-occur frequently in the training corpus. In this way, a hierarchical language model with explicit upper layer units can be built.

5.2 Combining sense spectra with word embeddings

Now that sense spectra are low dimensional and dense, we can directly concatenate them to the pre-trained word embeddings. Similar to the above discussion, for each word in the training corpus, we may find its related synsets. Then, we pick out the corresponding spectra of these synsets and perform the average summation over the spectra to get a summation vector. Finally, we concatenate the summation vector to the pre-trained embedding vector of the word .

In this way, the embedding vectors now contain not only the contextual information obtained from the corpus-based training, but also the semantic relationship information obtained from the knowledge-based training De Boom et al. (2016). We believe that such word embeddings are promising for tasks such as Word Sense Disambiguation (WSD) Yarowsky (2000)

and Outlier Detection

Aggarwal (2016), where information about semantic relationships are highly demanded. This is being investigated.

6 Conclusion

We provide sense spectra, which is the first dense embedding system that can explicitly and completely preserve the hypernym-hyponym tree structure for noun and verb synsets in WordNet. The explicit measurement on sense spectra is the hypernym intersection similarity (HIS), which is a similarity measurement describing the “commonness” and “uniqueness” between two noun or verb synsets in WordNet.

Results show that the HIS measurement outperforms the three basic similarity measurements in WordNet on the SimLex-999 noun and verb pairs. Moreover, we indicate by experiments that the hypernym-hyponym tree structure can be recovered from the sense spectra. Novel applications built on sense spectra are described and are being actively explored.

References

  • Abadi et al. (2016) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, and et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint, arXiv:1603.04467.
  • Agarap (2018) A. F. Agarap. 2018. Deep learning using rectified linear units (relu). Computing Research Repository (CoRR).
  • Aggarwal (2016) C. C. Aggarwal. 2016. Outlier Analysis. Second Edition. Springer, New York.
  • Bengio et al. (2003) Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. 2003. A neural probabilistic language model.

    Journal of Machine Learning Research (JMLR)

    , 3:1137–1155.
  • Boyd-Graber et al. (2006) J. Boyd-Graber, F. Fellbaum, D. Osherson, and R. Schapire. 2006. Adding dense, weighted connections to wordnet. In proceedings of the third global WordNet meeting.
  • Cape et al. (2017) J. Cape, M. Tang, and C. E. Priebe. 2017. The two-to-infinity norm and singular subspace geometry with applications to high-dimensional statistics. arXiv preprint, arxiv.org/pdf/1507.01127.
  • De Boom et al. (2016) C. De Boom, S. Van Canneyt, T. Demeester, and B. Dhoedt. 2016. Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognition Letters, pages 150–156.
  • Devlin et al. (2018) J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint, arXiv:1810.04805.
  • Esbensen and Geladi (1987) K. Esbensen and P. Geladi. 1987. Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2:37–52.
  • Gao and Xu (2013) C. Gao and B. Xu. 2013. The application of semantic field theory to english vocabulary learning. Theory and Practice in Language Studies, 3(11).
  • Hill et al. (2015) F. Hill, R. Reichart, and A. Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, volume 41.
  • Jones (1979) D. S. Jones. 1979. Elementary Information Theory. Clarendon Press, Oxford, UK.
  • Kingma and Ba (2014) D. P. Kingma and J. L. Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint, arXiv:1412.6980.
  • Liu and Curran (2006) V. Liu and J. R. Curran. 2006. Web text corpus for natural language processing. European Chapter of the Association for Computational Linguistics (ACL-EACL).
  • McDonald (2009) J. H. McDonald. 2009. Handbook of biological statistics. Sparky House, Baltimore.
  • Mikolov et al. (2013) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Miller (1995) G. A. Miller. 1995. Wordnet: A lexical database for english. Communications of the ACM, 38(11):39–41.
  • Miller et al. (1990) G. A. Miller, R. Beckwith, C. D. Fellbaum, D. Gross, and K. Miller. 1990. Wordnet: An online lexical database. International Journal of Lexicography, pages 235–244.
  • Miller and Hristea (2006) G. A. Miller and F. Hristea. 2006. Wordnet nouns: Classes and instances. Computational Linguistics, 32(1):1–3.
  • Navigli (2009) R. Navigli. 2009. Word sense disambiguation: A survey. ACM computing surveys (CSUR), 41(2):10.
  • Rojas (2015) R. Rojas. 2015. The curse of dimensionality. arxiv.org/pdf/1507.01127.
  • Rothe and Schutze (2015) S. Rothe and H Schutze. 2015. Autoextend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the 53rd ACL, volume 1:1793–1803.
  • Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. 1986. Learning internal representations by back-propagating errors. Nature, 323:533–536.
  • Saedi et al. (2018) C. Saedi, A. Branco, J. A. Rodrigues, and J. Silva. 2018. Wordnet embeddings. In Proceedings of The Third Workshop on Representation Learning for NLP at Association for Computational Linguistics, pages 122–131.
  • Slimani (2013) T. Slimani. 2013. Description and evaluation of semantic similarity measures approaches. Journal of Computer Applications, volume 80(10):pages 1–10.
  • de Winter et al. (2016) J. C. de Winter, S. D. Gosling, and J. Potter. 2016. Comparing the pearson and spearman correlation coefficients across distributionsand sample sizes: A tutorial using simulations and empirical data. Psychological Methods, 21(3):273–290.
  • Yamada et al. (2009) I. Yamada, K. Torisawa, J. Kazama, K. Kuroda, M. Murata, S. D. Saeger, F. Bond, and A. Sumida. 2009. Hypernym discovery based on distributional similarity and hierarchical structures. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 929–937.
  • Yarowsky (2000) D. Yarowsky. 2000. Hierarchical decision lists for word sense disambiguation. Computers and the Humanities, 34(1):179–186.