Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings

03/23/2020 ∙ by Christos Xypolopoulos, et al. ∙ 0

The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy, based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. We show through rigorous experiments that our rankings are well correlated (with strong statistical significance) with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our method makes it applicable to any language. Code and data are publicly available at https://github.com/ksipos/polysemy-assessment.



There are no comments yet.


page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Polysemy, the number of senses that a word has, is a very subjective notion, subject to individual biases. Word sense annotation has always been one of the tasks with the lowest values of inter-annotator agreement Artstein and Poesio (2008). Yet, creating high-quality, consistent word sense inventories is a critical pre-requisite to successful word sense disambiguation.

Towards creating word sense inventories, it can be helpful to have some reliable information about word polysemy. That is, knowing which words have many senses, and which words have only a few senses. Such information can help in creating new inventories, but also in validating and interpreting existing ones. It can also help in selecting which words to include in a study (e.g., only highly polysemous words).

We propose a novel, fully unsupervised and data-driven approach to quantify work polysemy, based on basic geometry in the contextual embedding space.

Contextual word embeddings have emerged in the last few years, as part of the NLP transfer learning revolution. Now, entire deep models are pre-trained on huge amounts of unannotated data and fine-tuned on much smaller annotated datasets. Some of the most famous examples include ULMFiT

Howard and Ruder (2018) and ELMo Peters et al. (2018)

, both based on recurrent neural networks; and GPT

Radford et al. (2018) and BERT Devlin et al. (2018), based on transformers Vaswani et al. (2017). These models all are very deep language models. During pre-training on large-scale corpora, they learn to generate powerful internal representations, including fine-grained contextual word embeddings. For instance, in a well pre-trained model, the word python will have two very different embeddings depending on whether it occurs in a programming context (as in, e.g., “python is my favorite language”) or in a ecological context (“while hiking in the rainforest, I saw a python”).

Our approach capitalizes on the contextual embeddings previously described. It does not involve any tool and does not rely on any human input or judgment. Also, thanks to its unsupervised nature, it can be applied to any language, provided that contextual embeddings are available.

The remainder of this paper is organized as follows. We detail our approach in section 2. Then, we present our experimental setup (sec. 3

), evaluation metrics (sec.

4), and report and interpret our results (sec. 5). In section 6, we present an interesting by-product of our method, that allows the user to sample sentences containing each a different sense of a given word. Finally, related work is presented in section 7.

2 Proposed approach

2.1 Basic asumption

First, by passing diverse sentences containing a given word to a pre-trained language model, we construct a representative set of vectors for that word (one vector for each occurrence of the word). The basic and intuitive assumption we make, is that

the volume covered by the cloud of points in the contextual embedding space is representative of the polysemy of the associated word.

2.2 Main idea: multiresolution grids

As a proxy for the volume covered, we adopt a simple geometrical approach. As shown in Fig. 1, we construct a hierarchical discretization of the space, where, at each level, the same number of bins are drawn along each dimension. Each level corresponds to a different resolution. Our polysemy score is based on the proportion of bins covered by the vectors of a given word, at each level.

This simple binning strategy makes more sense than clustering-based approaches. Indeed, clusters do not partition the space equally and regularly. This is especially problematic, since word representations are not uniformly distributed in the embedding space

Ethayarajh (2019)

. Indeed, in that case, the vectors lying in the same dense area of the space will always belong to one single large cluster, while outliers lying in the same, but sparser, area of the space, will be assigned to many different small clusters. Therefore, counting the number of clusters a given word belongs to is not a reliable indicator of how much of the space this word covers.

2.3 Scoring scheme

We quantify the polysemy degree of a word as:


where designates the proportion of bins covered by word at level , between 0 and 1. At each level, bins are drawn along each dimension (see the vertical and horizontal lines in Fig. 1). The hierarchy starts at since there is only one bin covering all the space at (so all words have equal coverage at this level). The total number of bins in the entire space, at a given level , is equal to .

Consider again the example of Fig. 1. In this example, each word is associated with a set of 10 contextualized embeddings in a space of dimension , and the hierarchy has levels. First, we can clearly see that word 1 (blue circles) covers a large area of the space while all the vectors of word 2 (orange squares) are grouped in the same region. Intuitively, this can be interpreted as “word 1 occurs in more different contexts than word 2”, which per our assumption, is equivalent to saying that “word 1 is more polysemous than word 2”.

Let us now see how this is reflected by our scoring scheme. First, the penalization terms (denominators) for levels 1 to 3 are . Note that the higher the level, the exponentially more bins, and so the less penalized (or the more rewarded) coverage is, because getting good coverage becomes more and more difficult. Now, per Eq. 1, the score of word 1 is computed as the dot product of its coverage vector (coverage at each level) with the penalization vector, which gives a score of . Likewise, the score of word 2 is computed as . We can thus see that our scores reflect what can be observed in Fig. 1: word 1 covers a larger area of the space than word 2.

Note that the score of a given word is only meaningful in comparison with the scores of other words, i.e., in rankings, as will be seen in the next section.

Implementation. To compute our scores, we built on the code of the pyramid match kernel from the GraKeL Python library Siglidis et al. (2018).

3 Experiments

In this section, we describe the protocol we followed to test the extent to which our rankings match human rankings.

3.1 Word selection

The first step was to select words to include in our analysis. To this purpose, we downloaded and extracted all the text from the latest available English Wikipedia dump222https://dumps.wikimedia.org/. We then performed tokenization, stopword, punctuation and number removal, and counted the occurrence of each token of at least 3 characters in size. Out of these tokens, we kept the 2000 most frequent.

3.2 Generating vector sets

For each word in the shortlist, we randomly selected 3000 sentences such that the corresponding word appeared exactly once within each sentence. The words that did not appear in at least 3000 sentences were removed from the analysis, reducing the size of the shortlist from 2000 to 1822. Then, for each word, the associated sentences were passed through a pre-trained ELMo model333We used the implementation and pre-trained weights publicly released by the authors https://allennlp.org/elmo. Peters et al. (2018) in test mode, and the top layer representations corresponding to the word were harvested. The advantage of using ELMo’s top layer embeddings is that they are the most contextual, as shown by Ethayarajh (2019). We ended up with a set of exactly 3000 1024-dimensional contextual embeddings for each word.

3.3 Dimensionality reduction

Remember that the total number of bins in the entire space is equal to at a given level , which would have given us an infinite number of bins even at the first level, since the ELMo representations have dimensionality . To reduce the dimensionality of the contextual embedding space, we applied PCA, trying 19 different output dimensionalities, from to with steps of . Due to the quantity and high initial dimensionality of the vectors, we used the distributed44415 executors with 10 GB of RAM each. version of PCA provided by the PySpark’s ML Library Meng et al. (2016).

3.4 Score computation

For each PCA output dimensionality, we computed our scores, trying with 18 different hierarchies whose numbers of levels ranged from 2 to 19. So in total, we obtained rankings.

3.5 Ground truth rankings and baselines

We evaluated the rankings generated by our approach against several ground truth rankings that we derived from human-constructed resources.

Since the number of senses of a word is a subjective, debatable notion, and thus may vary from source to source, we included 6 ground truth rankings in our analysis, in order to minimize source-specific bias as much as possible. For sanity checking purposes, we also added two basic baseline rankings (frequency and random).

We provide more details about all rankings in what follows.

3.5.1 WordNet

We used WordNet Miller (1998) version 3.0 and counted the number of synonym sets or “synsets” of each word.

3.5.2 WordNet-Reduced

There are very subtle differences among the WordNet senses (“synsets”), making distinguishing between them difficult, and even irrelevant in some applications Palmer et al. (2004, 2007); Brown et al. (2010); Rumshisky (2011); Jurgens (2013). For instance, call has 41 senses in the original WordNet (28 as verb and 13 as noun). Even for other words with less senses, like eating (7 senses in total), the difference between senses can be very tiny. For instance, “take in solid food” and “eat a meal; take a meal” are really close in meaning. This very fine granularity of WordNet may somewhat artificially increase the polysemy of some words.

To reduce the granularity of the WordNet synsets, we used their sense keys555See ‘Sense Key Encoding’ here: https://wordnet.princeton.edu/documentation/senseidx5wn. They follow the format lemma%ss_type:lex_filenum: lex_id:head_word:head_id, where ss_type represents the synset type (part-of-speech tag such as noun, verb, adjective) and lex_filenum represents the name of the lexicographer file containing the synset for the sense (noun.animal, noun.event, verb.emotion, etc.). We truncated the sense keys after lex_filenum.

For instance, “take in solid food” and “eat a meal; take a meal” initially correspond to two different senses with keys eat%2:34:00:: and eat%2:34:01::, but after truncation, they both are mapped to the same sense: eat%2:34. However, coarse differences in senses are still captured. For instance, bank “sloping land” (bank%1:17:01::) and bank “financial institution” (bank%1:14:00::) are still mapped to two different senses after truncation, respectively bank%1:17 and bank%1:14.

3.5.3 WordNet-Domains

WordNet Domains Bentivogli et al. (2004); Magnini and Cavaglia (2000) is a lexical resource created in a semi-automatic way to augment WordNet with domain labels. Instead of synsets, each word is associated with a number of semantic domains. The domains are areas of human knowledge (politics, economy, sports, etc.) exhibiting specific terminology and lexical coherence. As for the two previous WordNet ground truth rankings, we simply counted the number of domains associated with each word.

3.5.4 OntoNotes

OntoNotes Hovy et al. (2006); Weischedel et al. (2011) is a large annotated corpus comprising various genres of text (news, conversational telephone speech, weblogs, newsgroups, broadcast, talk shows) with structural information and shallow semantics.

We counted the senses in the sense inventory of each word. The senses in OntoNotes are groupings of the WordNet synsets, constructed by human annotators. As a result, the sense granularity of OntoNotes is coarser than that of WordNet Brown et al. (2010).

3.5.5 Oxford

We counted the number of senses returned by the Oxford dictionary666www.lexico.com, which was, at the time of this study, the resource underlying the Google dictionary functionality.

3.5.6 Wikipedia

We capitalized on the Wikipedia disambiguation pages777https://en.wikipedia.org/wiki/word_(disambiguation). Such pages contain a list of the different categories under which one or more articles about the query word can be found. For example, the disambiguation page of the word bank includes categories such as geography, finance, computing (data bank) and science (blood bank). We counted the number of categories on the disambiguation page of each word to generate the ranking.

3.5.7 Frequency and random baselines

In the frequency baseline, we ranked words in decreasing order of their frequency in the entire Wikipedia dump (see subsection 3.1). The naive assumption made here is that words occurring the most have the most senses.

With the random baseline, on the other hand, we produced rankings by shuffling words. Further, we assigned them random scores by sampling from the Log Normal distribution


with mean and standard deviation 0 and 0.6 (resp.)

, to imitate the long-tail behavior of the other score distributions, as can be seen in Fig. 2. All distributions can be seen in Fig. 6. Note that to account for randomness, all results for the random baseline are averages over 30 runs.

Figure 2: Average score distribution of the 5 ground truth rankings and frequency baseline (histogram) vs. average score distribution of the random baseline (blue curve).

Not every of the 1822 words included in our analysis had an entry in each of the resources described above. The lengths of each ground truth ranking are shown in Table 1.

Ranking # words
WN 1535
WN-reduced 1535
WN-Domains 1420
Oxford 1536
Wikipedia 1042
OntoNotes 723
Frequency & random 1822
Table 1: Length of the ground truth rankings.

4 Evaluation metrics

As will be detailed next, we used 6 standard metrics from the fields of statistics and information retrieval to compare among methods. To ensure fair comparison, the scores in the rankings of all methods were normalized to be in the range before proceeding.

Also, each method played in turn the role of candidate and ground truth. This allowed us to not only compute the similarity between our rankings and the ground truth rankings, but also the similarity among the ground truth rankings themselves, which was interesting for exploration purposes.

For each pair of evaluated and ground truth method, only the parts of the rankings corresponding to the words in common (intersection) were compared. Thus, the rankings in each (candidate,ground truth) pair had equal length.

4.1 Similarity and correlation metrics

4.1.1 Cosine similarity

Cosine similarity measures the angle between the two vectors whose coordinates are given by the scores in the evaluated and ground truth rankings. What is evaluated here is the alignment between rankings, i.e., the extent to which the candidate method assigns high/low scores to the same words that receive high/low scores in the ground truth. Since all rankings have positive scores, cosine similarity is in , where 0 indicates that the two vectors are orthogonal and 1 means that they are perfectly aligned. Since we are computing the value of an angle, only the ratios/proportions of scores matter here. E.g., the two rankings and would be considered perfectly aligned.

4.1.2 Spearman’s rho

Spearman’s rho Spearman (1904)

is a measure of rank correlation. More precisely, it equals the famous Pearson product-moment correlation coefficient (

) computed from the ranks of the scores in the two rankings, rather than on the scores themselves.

4.1.3 Kendall’s tau

Kendall’s tau Kendall (1938) is another measure of rank correlation, based on signs of ranks. One can compute it by counting concordant and discordant pairs among the ranks of the scores in the two rankings. More precisely, given two rankings and , a pair for is said to be concordant if and . Based on this notion, the metric is expressed:


Kendall’s tau can also be written:


where sign designates the sign function and is the length of the two rankings.

Both Spearman’s rho and Kendall’s tau take values in (for reversed and same rankings), and approach zero when the correlation between the two rankings is low (independence).

4.2 Information retrieval metrics

4.2.1 p@k

Here, we simply compute the percentage of words in the top 10% of the candidate ranking that are present in the top 10% of the ground truth ranking. The idea here is to measure ranking quality for the most polysemous words.

4.2.2 Ndcg

The Normalized Discounted Cumulative Gain or NDCG Järvelin and Kekäläinen (2002) is a standard metric in information retrieval. It is based on the Discounted Cumulative Gain (DCG):


where designates the ground truth score of the word at the position in the ranking under consideration, and denotes the length of the ranking. NDCG is then expressed as:


the denominator is called the ideal DCG, or IDCG. It is the DCG computed with the order provided by the ground truth ranking, that is, for the best possible word positioning.

Since the scores are penalized proportionally to their position in the ranking (with some concavity), the more words with high ground truth scores are placed on top of a candidate ranking, the better the NDCG of that ranking. NDCG is maximal and equal to 1 if the candidate and ground truth rankings are identical.

4.2.3 Rbo

The Rank Biased Overlap or RBO Webber et al. (2010) takes values in , where 0 means that the two rankings are independent and 1 that they match exactly. It is computed as:


where is the proportion of words belonging to both rankings up to position , is the length of the rankings, and is a parameter controlling how steep the decline in weights is: the smaller , the more top-weighted the metric is. When , only the top-ranked word is considered, and the RBO is either zero or one. When is close to 1, the weights become flat, and more and more words are considered. We used in our experiments, which means that the top 50 positions received 86% of all weight.

4.3 Implementations

We used the base R R Core Team (2018) cor() function999https://stat.ethz.ch/R-manual/R-patched/library/stats/html/cor.html to compute the and statistics. For RBO, we relied on a publicly available Python implementation101010 https://github.com/changyaochen/rbo. For all other metrics, we wrote our own implementations.

Figure 3: Pairwise similarity matrices between methods. For readability, all scores are shown as percentages. For Kendal and Spearman, * and *** mean statistical significance at and , respectively. For a given metric, our configuration that best matches (on average) all other methods (except random and frequency) is always shown first. DL means that the compressed contextual embedding space has D dimensions and that the hierarchy has L levels. Rand, freq, wiki, oxf, ON, WN, WNred, and WNdom are short for random, frequency, Wikipedia, Oxford, OntoNotes, WordNet, WordNet reduced, and WordNet domains. All metrics except NDCG are symmetric, hence we only show one triangle for them. For NDCG, candidate methods are shown as columns and ground truths as rows.
sentences bin coordinates
it stars christopher lee as count dracula along with dennis waterman
the count of the new group is the sum of the separate counts of the two original groups
the first fight did not count towards the official record
five year old horatia came to live at merton in may 1805
it features various amounts of live and backstage footage while touring
first tax bills were used to pay taxes and to register bank deposits and bank credits
the ball nest is built on a bank tree stump or cavity
Table 2: Sentences containing different senses of the same word can be sampled by selecting from different bins.

5 Results

Our rankings correlate well with human rankings. Results are shown in Fig. 3, as pairwise similarity matrices, for all six metrics. For readability, all scores are shown as percentages. For a given metric, our configuration that best matches, on average, all other methods (except random and frequency) is always shown as the first column. Since all metrics except NDCG are symmetric, we only show the lower triangles of the other matrices. For NDCG, candidate methods are shown as columns and ground truths as rows.

For each of the six evaluation metrics, it can be seen that the ranking generated by our unsupervised, data-driven method is well correlated with all human-derived ground truth rankings. This means that our method is robust to how one defines and measures correlation or similarity.

In some cases, we even very closely reproduce the human rankings. For instance, our best configurations for cosine and NDCG get almost perfect scores of 86.5 and 99.72 when compared against Wikipedia. In terms of Kendall’s tau, Spearman’s rho, p@k, and RBO, we are also very close to OntoNotes (scores of 49.43, 35.23, 39.53, and 33.47, resp.).

Finally, the correlation between our rankings and the human rankings can also be observed to be, everywhere, much stronger than that between the baseline rankings (random and frequency) and the human rankings.

Statistical significance. We computed statistical significance for the Spearman’s rho and Kendall’s tau metrics. As can be seen in Fig. 3

, the null hypothesis that there is no correlation between our rankings and the human-derived ground truth rankings, was systematically rejected everywhere, with very high significance (


However, against the random baseline, the same null hypothesis (no correlation) was accepted everywhere. Against frequency, the null was rejected, but very weakly (only at the level), and with very low correlation coefficients (6.53 for Spearman and 4.44 for Kendall).

Finally, the correlation between the random and frequency rankings and the ground truth rankings is never statistically significant, with the exception of the pair frequency/OntoNotes, but again, at a weak level ().

Hyperparameters have a significant impact on performance, but optimal values are consistent across metrics. First, as can be observed from Fig. 4 and Fig. 5, there is a large variability in performance when (number of PCA dimensions) and (number of levels in the hierarchy) vary.

However, for all six evaluation metrics, the best configurations are very similar: , , , , , and 111111for RBO, and had the same score.. Given the rather large grid we explored ( for and , resp.), with 342 combinations in total, we can say that all these optimal values belong to the same small neighborhood. This interpretation is confirmed by inspecting Fig. 4, where it can clearly be seen that the optimal area of the hyperparameter space is robust to metric selection, and consistently corresponds to small values of (around 3), and values of at least above 3 or 4, ideally around 8. For larger values of , performance plateaus (keeping fixed). In other words, it is necessary to have some levels in the hierarchy, but having very deep hierarchies is not required for our method to work well. A benefit of having such small optimal values of and is their affordability, from a computational standpoint.

All rankings derived from WordNet-based resources are highly correlated. It is interesting to note that the rankings generated from OntoNotes, WordNet, WordNet reduced, and WordNet domains, all are highly similar. And this, despite the very different sense granularities they have. This means that despite the apparent differences in these resources, they all tend to assign the same number of senses to the same words. The Oxford rankings tend to be part of this high-similarity cluster as well, to a lesser extent.

Frequent words are not the most polysemous. Finally, one last interesting observation we can make is that while the frequency ranking is much better than the random ones, it still is far away from the human rankings. In other words, the frequency of appearance of a word (excluding stopwords, of course), is not as good an indicator of its polysemy as one could expect.

Figure 4: Performance (color scale) vs. number of PCA dimensions ( axis) vs. number of levels in the hierarchy ( axis).
Figure 5: Performance distributions over the 342 values in the discrete hyperparameter space (grids of Fig. 4).
Figure 6: Normalized ranking score distributions for the random and frequency rankings and the human-derived ground truth rankings.

6 Sampling diverse examples

An interesting application of our discretization strategy is that it can be used to select sentences containing different senses of the same word, as illustrated in Table 2. Provided a mapping, for a given word, between the sentences that were passed to the pre-trained language model and the vectors, we can sample vectors from different bins and retrieve the associated sentences. If the bins are distant enough, the sentences will contain different senses of the word. For instance, in Table 2, we can see that we are able to sample sentences containing three senses of the word count: (1) noble title, (2) determining the total number, and (3) taking into account. While a by-product of our approach, this sampling methodology has many useful applications in practice, e.g., in online dictionaries, dataset creation, etc.

7 Related work

Task. To the best of our knowledge, this study is the first to focus purely on polysemy quantification, that is, on estimating the number of senses of words, without trying to label these senses. Also, this study is, still to the best of our knowledge, the first to approach word sense disambiguation (or a subtask thereof, to be precise), from a purely empirical and unsupervised standpoint. Indeed, except for performance evaluation, no human annotators (even non-expert ones), and no human-constructed word sense inventories or dictionaries, are involved in our process.

For the reasons above, we did not find any previous work directly comparable with ours in the literature. However, several previous efforts have interested themselves in creating sense inventories without human experts.

For instance, in Rumshisky (2011); Rumshisky et al. (2012) 121212We asked the authors to share annotations with us to use as ground truth, but they were unable to do so., Amazon Mechanical Turk (AMT) workers are given a set of sentences containing the target word, and one sentence that is randomly selected from this set as a target sentence. Workers are then asked to judge, for each sentence, whether the target word is used in the same way as in the target sentence. This creates an undirected graph of sentences. Clustering can then be applied to that graph to find senses. To label clusters with senses, one has to manually inspect the sentences in each cluster.

More recently, Jurgens (2013)131313same as footnote 12. compared three annotation methodologies for gathering word sense labels on AMT. The methods compared are Likert scales, two-stage select and rate, and difference between counts of when senses were rated best/worst. Regardless of the strategy, inter-annotator agreement remains low (around 0.3).

Methodology. In the original ELMo paper, Peters et al. (2018) have shown that using contextual word representations (through nearest neighbor matching) improves word sense disambiguation. Hadiwinoto et al. (2019) showed that this technique, along with some other ones, works well for BERT too.

From a methodological point of view, our approach is related in spirit to pyramid matching Nikolentzos et al. (2017); Grauman and Darrell (2007); Lazebnik et al. (2006)

. This kernel-based method has originated in computer vision, and computes the similarity between objects by placing a sequence of increasingly coarser grids over the feature space and taking a weighted sum of the number of matches that occur at each resolution level. Matches found at finer resolutions are weighted more highly than matches found at coarser resolutions.

8 Conclusion

We proposed a novel unsupervised, fully data-driven geometrical approach to estimate word polysemy. Our approach builds multiresolution grids in the contextual embedding space. We showed through rigorous experiments that our rankings are well correlated (with strong statistical significance) to 6 different human rankings, for 6 different metrics. Such fully data-driven rankings of words according to polysemy can help in creating new sense inventories, but also in validating and interpreting existing ones. Increasing the quality and consistency of sense inventories is a key first step of the word sense disambiguation pipeline. We also showed that our discretization can be used, at no extra cost, to sample contexts containing different senses of a given word, which has useful applications in practice. Finally, the fully unsupervised nature of our method makes it applicable to any language.

While our scores are a good proxy for polysemy, they are not equal to word sense counts. Moreover, we do not label each sense. Future work should address these challenges, by, e.g., automatically selecting bins of interest, and generating labels for them. Another direction of work is investigating how different contextual embeddings (e.g., BERT) impact our rankings. Finally, it would be interesting to test the effect on performance of basic transformations of the contextual embedding space, such as that proposed in Mu et al. (2017).

9 Acknowledgments

We thank Giannis Nikolentzos for helpful discussions about pyramid matching. The GPU that was used in this study was donated by the NVidia corporation as part of their GPU grant program. This work was supported by the LinTo project.