A Simple Approach to Learn Polysemous Word Embeddings

07/06/2017 ∙ by Yifan Sun, et al. ∙ 0

Many NLP applications require disambiguating polysemous words. Existing methods that learn polysemous word vector representations involve first detecting various senses and optimizing the sense-specific embeddings separately, which are invariably more involved than single sense learning methods such as word2vec. Evaluating these methods is also problematic, as rigorous quantitative evaluations in this space is limited, especially when compared with single-sense embeddings. In this paper, we propose a simple method to learn a word representation, given any context. Our method only requires learning the usual single sense representation, and coefficients that can be learnt via a single pass over the data. We propose several new test sets for evaluating word sense induction, relevance detection, and contextual word similarity, significantly supplementing the currently available tests. Results on these and other tests show that while our method is embarrassingly simple, it achieves excellent results when compared to the state of the art models for unsupervised polysemous word representation learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in word representation learning such as word2vec (Mikolov et al., 2013b)

have significantly boosted the performance of numerous Natural Language Processing tasks

(Mikolov et al., 2013b; Pennington et al., 2014; Levy et al., 2015). Despite their empirical performance, the inherent one-vector-per-word setting limits its application on tasks that require contextual understanding due to the existence of polysemous words such as part-of-speech tagging and semantic relatedness (Li and Jurafsky, 2015).

To this end, various sense-specific word embeddings have been proposed to account for the contextual subtlety of language (Reisinger and Mooney, 2010b, a; Huang et al., 2012; Neelakantan et al., 2015; Tian et al., 2014; Chen et al., 2014; Li and Jurafsky, 2015; Arora et al., 2016). A majority of these methods propose to learn multiple vectors for each word via clustering. (Reisinger and Mooney, 2010b; Huang et al., 2012; Neelakantan et al., 2015)

uses neural networks to learn cluster embeddings in order to matcha polysemous word with its correct sense embeddings. Side information such as topical understanding

(Liu et al., 2015b, a) or paralleled foreign language data (Guo et al., 2014; Šuster et al., 2016; Shyam et al., 2017) have also been exploited for clustering different meanings of multi-sense words. Another trend is to forgo word embeddings in favor of sentence or paragraph embeddings for specific tasks (amiri2016learning; Kiros et al., 2015; Le and Mikolov, 2014). While being more flexible and adaptive to context, all these approaches require sophisticated neural network structures and are problem specific, taking away the advantage offered by the unsupervised embedding approaches of single-sense embeddings. This paper bridges this gap.

In this paper we propose a novel and extremely simple approach to learn sense-specific word embeddings. The essence of our approach is to assign each word a global base vector and model the contextual embedding as a linear combination of its context base vectors. Instead of a joint optimization to learn both base vector and combination weights, we propose to use the standard unisense word representation for the base vectors, and the (suitably normalized) word co-occurrence statistics as the linear combination weights; no further training computations are required in our approach.

We evaluate our approach on various tasks that require contextual

understanding of words, combining existing and new test datasets and evaluation metrics: word-sense induction (

(Koeling et al., 2005; Bartunov et al., 2015)), contextual word similarity ((Huang et al., 2012) and a new test set), and relevance detection ((Arora et al., 2016) and a new test set). To the best of our knowledge, no prior literature has provided a comprehensive evaluation of all these multisense-specific tasks. Our simple, intuitive model retains almost all advantages offered by more complicated multisense embedding models, and often surpasses the performance of nonlinear “deep” models. Our code and data are at {https://github.com/dingwc/multisense/}

To summarize, the contributions of our paper are as follows:

  1. We propose an extremely simple model for learning polysemous word representations

  2. We propose several new larger test sets to evaluate polysemous word embeddings, supplementing those that already exist

  3. We perform extensive comparisons of our model to other widely used multisense models in the literature, and show that the simplicity of our model does not tradeoff performance

The rest of the paper is organized as follows: in the next section, we introduce our model and provide a detailed explanation for obtaining the multisense word embeddings. In Section 3 qualitatively evaluate our model. In Section 4 we introduce our new evaluation tasks and datasets in details. We also perform extensive experiments on four quantitative tests that are multisense specific. We finally conclude out paper in Section 5.

2 Our Contextual Embedding Model

Like unisense vectors, sense-specific vectors should be closely aligned to words in that sense. This idea of local similarity has been widely used to obtain context sense representation (Chen et al., 2014; Huang et al., 2012; Le and Mikolov, 2014; Neelakantan et al., 2015). It was also used to decompose unisense vector into sense specific vectors (Arora et al., 2016). In this paper, we exploit this intuition and model the contextual embedding of a word as a linear combination of its contexts.

Specifically, we consider a corpus drawn from a vocabulary . We define the normalized cooccurence matrix as the (sparse111The sparsity of was studied in (Pennington et al., 2014).) symmetric matrix where


We define a context as a collection of words provided alongside the target word. The context is flexible. It can be a sentence or a paragraph in which the word appeared, or a set of synonyms from WordNet. A standard unisense embedding (such as word2vec (Mikolov et al., 2013b)) can be represented as a matrix , where is the embedding dimension and the th column of is the embedding vector for the th word in . Then the multisense embedding of word given context is


and is the th column of . Take, for example, the word bank with context I must stop by the bank for a quick withdrawal. The multisense embedding is a weighted sum of the base embeddings of each context word. Note that some words (withdrawal) are more relevant than others (need, stop, quick); the weight for each context word is the normalized co-occurance, which filters for relevant context words.

We can view the sets of columns of as spanning sense subspaces. For example, the (likely low dimensional) subspace spanned by the submatrix of corresponding to vectors for financial terms should also be highly correlated with savings and much less correlated with river; in other words, the mutisense word bank provided in either context will be well separated.


In the remaining of the paper, we use via (1), constructed from the 2016 English Wikipedia Dump222 https://dumps.wikimedia.org/enwiki/ with a local window of size . The final vocabulary results after filtering away non-English words, stop words, and rare words occurring under 2,000 times. This results in a vocabulary of size . For base vectors we use either the pre-trained GLoVe embedding with 333 http://nlp.stanford.edu/projects/glove/, or the word2vec (w2v) embedding trained over the wikipedia corpus with and . The w2v embeddings are trained using skip-gram model with negative sampling. We set the number of negative samples to be

and number of training epochs to be

. The and matrices are attached alongside our submission.

3 Qualitative Examples

3.1 Norm Distribution of Our Approach

Figure 1: The norm distribution of contextual embeddings of all the word-context pairs in the word-context-relevance R1 dataset based on (uni=w2v,dim=100) embedding. The upper plot depicts the full distribution while the bottom plot depicts the zoomed-in distribution with -norm close to . Word-context pairs are labeled either relevant or irrelevant in the dataset.

Before we present qualitative examples, our first observation of the resulting embedding of word-context pairs is that the embedding vectors of words in irrelevant contexts have very low norms. In Fig. 1 we selected around 500 word-context pairs from a word-context-relevance evaluation dataset (see Section 4 for details) and plot the histogram of the contextual embedding vector norms. In this evaluation data word-context pairs are labeled either relevant or irrelevant. As depicted in Fig. 1, the norm filtering effect is essential, given that unlike previous embeddings, our embeddings allow any word to act as context. In contrast, the multi-sense embeddings from prior literature (Figure 2444Details of how we get embeddings from these competing methods are in Section 4.) all have very similar norm distributions, regardless of context relevance.

Figure 2: The norm distribution of all the word-context pairs in the word-context-relevance R1 dataset for (Neelakantan et al., 2015) texttt300d_10s_1.32c_0mv embedding (top), (Huang et al., 2012) cossim embedding (middle), (Chen et al., 2014) (bottom). The norm-distribution plot indicates that contextual embeddings of relevant and irrelevant word-context pairs have similar norms, which is typical of existing polysemous word embeddings. Contrast this to Figure 1 (our method), where the norms are distributed according to relevance.

This relevance filtering effect is advantageous in sentences where many neighboring words may not be describing at the query word. However, the extreme decaying distribution of our method (and (Chen et al., 2014)) in the above figures can make it difficult to measure contextual word similarity using simply cosine distance, as it magnifies words with very small norm that had already been identified as irrelevant. In the other extreme, using dot product overemphasizes common words. To mitigate this, we present a generalized similarity measure, with a tunable parameter that describes exactly how much the norm should be taken into account.

3.2 Similarity Measure of Our Embedding

We propose to measure the contextual similarity using the geometric mean of the cosine similarity and dot product, as a tradeoff of these two extremes:


for . Specifically, and is the cosine distance between and .

Table 1 looks at the closest words to bank in its two contexts for different choice of -distance measure. For dot-product we see an overemphasis of popular words that are only marginally related (gently, steeply). For cosine similarity, rare words are overpromoted (saxony, sacramento). In general, and can reasonably measure contextual similarity.

context: institution , currency, deposit, money, finance
Closest words
currencies, currency, deposit, laundering, franc
currencies, currency, deposit, laundering, franc
deposit, currencies, currency, repayment, hedge
repayment, cheque, deposit, liquidity, borrowers
bank, credit, triangle, saxony, linking
context: water, land, sloping, river, flooding
Closest words
steeply, gently, landslides, sloping, torrential
steeply, gently, landslides, tributaries, tributary
tributaries, tributary, yangtze, confluence, empties
tributaries, tributary, yangtze, confluence, empties
bank, sacramento, mouth, trail, fork
Table 1: Closest words to bank in the context of finance and geology, for various choices of in (3). Recall that is the dot product and is cosine similarity.

To show the effect of when only relevant pairs are used, Figure 3 plots the SCWS score (see Section 4; also (Huang et al., 2012)) for varying , for the top-performing embedding from each method. Cos-distance works best for all embeddings. However, for embeddings from (Huang et al., 2012) and (Neelakantan et al., 2015), which all have norm 1, the choice of makes little difference. On the other hand, using the embeddings from (Chen et al., 2014) and our method, which both have highly varying norms, the choice of greatly affects performance.

Figure 3: Spearman correlation score for varying in (3) on SCWS with different embeddings. Only the embedding that gave each approach the best SCWS scores are reported.

3.3 Qualitative Example

Having explained the norm-filtering property of our approach and the -distance measure in Eq. (3), we are now able to show a few qualitative examples of our model. First, Table 2 shows closest words to bank in three different context. Here we used the GloVe embeddings as and set . Contexts are selected from news articles about finance, weather, and sports (an irrelevant context). The third case illustrates the filtering effect, with a norm that is an order of magnitude smaller than the first two.

Banco Santander of Spain said on Wednesday that its profit declined by nearly half in the second quarter on restructuring charges and a contribution to a fund to help finance bank bail ins in Europe.
neighbors banco, santander, hsbc, brasil, barclays
The Seine has continued to swell since the river burst its banks on Wed., raising alarms throughout the city. As of 10pm on Friday its waters had reached 20 feet. The river was expected to crest on Sat. morning at up to 21 feet and to remain at high levels throughout the weekend.
neighborss banks, ganges, bank, tigris, river
Familia rebounded to strike out Tony Wolters, but first baseman James Loney then fumbled a slow ground ball by pinch hitter Cristhian Adames that allowed the tying run to score and left the bases loaded.
neighbor footed, balls, batsman, batsmen, winger
Table 2: Nearest words to and the -norm of the contextual embedding of bank in 3 different contexts.
jack the ripper
U billy, nicholson, tom, murphy, kelly
M whitechapel, murders, judas, owens, whedon
donald duck
U jack, lee, george, lamb, howard
M daffy, lame, waterfowl, goofy, scrooge, teal
steve jobs
U job, employees, hired, workers, manager
M pixar, apple

, odd,

macintosh, unemployed, commute
Table 3: Nearest words to phrase embedding (excluding itself). Words relevant specifically to the phrase are bolded. U=unisense. M=multisense.

Next, we investigate the use of our embedding model on phrase embeddings, constructed for example by averaging the embeddings of all words in the phrase, with the phrase itself as the context. In Table 3, we pick three well-known bi-grams (the is a stopword and ignored). The bi-gram embedding is the average of either the unisense GloVe embeddings (U) or the multisense embedding (M) using our model. The closest words to these embeddings are listed in Table 3. We observe that the contextual phrase embeddings are able to pull out meanings having to do with the phrase as a whole, and not just the sum of its parts.

Finally, in Table 4, we list the words with highest norms, when projected in a single word context. In all the cases we observe, high-norm words are highly relevant to the single context words. In the case of multisense words, a mixture of the different senses appear. (e.g., chips have potato and pentium as relevant, keyboard has layout and harpsichord.)

context largest norm words
eye retina, ophthalmology, eye, sockets
keyboard keyboard, layouts, harpsichord, sonata
run yd, inning, td, rb
ball fumbled, lucille, ball, wrecking
chips chips, potato, pentium, chip
Table 4: Top words in the context of a single word.

4 Empirical study

In this section we validate the performance of our embedding approach on a wide range of tasks that explicitly require contextual understanding of words. In Table 5 we collect (to our knowledge) the extent of multisense quantitative evaluations (columns 3-6 and 11-13), and supplement them with new, larger test sets (columns 7-9). All tests are provided in the attached dataset folder. In order to keep the evaluation fair across all embeddings, any word that is not in the intersection of the vocabularies of all embeddings is removed from the tests; for the preexisting test sets, this results in slightly smaller test sets than those first proposed.

Similarity measure

When irrelevant words are present, using is essential to leverage the norm distribution filtering effect. However, in all standard word similarity tasks, only relevant words are used as comparison. Therefore, in order to have our evaluations comparable with standard metrics, we keep (measuring cosine similarity for all tasks).

We compare our method against multisense approaches in (Huang et al., 2012; Neelakantan et al., 2015; Chen et al., 2014). In each case, we use their pre-trained model and choose the embedding of a target word that is closest to the context representation (as they suggest). Since the code in (Huang et al., 2012) allows choosing various distance functions, we pick all and report the best scores. For (Neelakantan et al., 2015; Chen et al., 2014) we use the cosine distance as recommended.

Overall Performance Table 5 shows that our method consistently outperforms (Huang et al., 2012; Neelakantan et al., 2015). We note that (Chen et al., 2014) is learned using additional supervision from the WordNet knowledge-base in clustering; therefore, it achieves comparably much higher scores in WSR and CWS tasks in which the evaluation is also based on WordNet. We now describe each task in detail.

Embeddings dim. WCR CWS SCWS WSC
R1 R2 R3 C1 C2
Sp. P@1 Sp. P@1 Sp. P@1 AUC AP Sp. Acc Acc
(Huang et al., 2012)
Euc. Dist. 50 0.08 0.13 0.24 0.31 0.37 0.45 0.73 0.51 0.35 0.72 0.60
Max Diff. 50 0.07 0.13 0.18 0.25 0.29 0.38 0.73 0.52 0.32 0.67 0.60
Min Diff. 50 0.01 0.09 0.02 0.10 0.01 0.17 0.71 0.53 0.27 0.61 0.60
Intersect dist. 50 0.02 0.36 0.10 0.46 0.07 0.46 0.69 0.47 0.35 0.62 0.60
Angle (cos-sim) 50 0.19 0.29 0.24 0.33 0.34 0.44 0.73 0.51 0.39 0.72 0.60
City block dist. 50 0.08 0.13 0.22 0.30 0.35 0.43 0.73 0.51 0.36 0.68 0.60
Hamming dist. 50 0.15 0.27 0.19 0.31 0.27 0.43 0.72 0.51 0.37 0.68 0.60
Chi Sq. 50 0.10 0.17 0.14 0.20 0.52 0.19 0.72 0.52 0.32 0.67 0.60
(Neelakantan et al., 2015)
3s 30kmv 50 0.20 0.27 0.25 0.34 0.39 0.49 0.72 0.47 0.53 0.70 0.62
3s 0mv 300 0.22 0.30 0.27 0.38 0.41 0.54 0.66 0.44 0.59 0.70 0.62
3s 30kmv 300 0.20 0.29 0.27 0.39 0.42 0.53 0.69 0.45 0.58 0.70 0.63
10s 1.32c 0mv 50 0.21 0.29 0.25 0.35 0.43 0.55 0.71 0.47 0.53 0.71 0.63
10s 1.32c 30kmv 50 0.20 0.30 0.24 0.30 0.42 0.52 0.72 0.48 0.51 0.69 0.63
10s 1.32c 0mv 300 0.22 0.32 0.27 0.37 0.45 0.58 0.66 0.44 0.60 0.69 0.63
(Chen et al., 2014) 200 0.44 0.73 0.46 0.86 0.63 0.95 0.96 0.91 0.48 0.75 0.66
Our method
uni=GloVe 100 0.34 0.54 0.33 0.51 0.46 0.61 0.89 0.82 0.57 0.83 0.77
uni=w2v 50 0.33 0.51 0.33 0.52 0.46 0.61 0.89 0.82 0.61 0.81 0.76
uni=w2v 100 0.33 0.51 0.33 0.52 0.46 0.61 0.90 0.83 0.62 0.80 0.77
Table 5: Summary of all quantitative tests and performance metric of our embedding approach against compared baselines. For (Huang et al., 2012) different rows correspond to various types of distance used to get the contextual embedding. For (Neelakantan et al., 2015) the row labels follow the terminology in the paper. We highlight the top 3 results for each test and metric in the table. Sp = Spearman correlation. P@1 = Precision@1. AP = Average Precision. AUC = Area Under Curve. Acc = Classification Accuracy.

Word-Context Relevance (WCR) This task is proposed in (Arora et al., 2016) and aims to detect when word-context pairs are relevant. In (Huang et al., 2012; Neelakantan et al., 2015; Chen et al., 2014), the relevance metric can be seen as the distance (cosine or Euclidean) between the query word and the context cluster center. In our method, we rely on the filtering effect of values to diminish the norm of words in irrelevant contexts; thus we propose the -norms of the contextual embedding as the metric of relevance, where the target word is excluded from the context if present.555The exclusion of the word is intended to highlight the filtering out of irrelevant contexts; in all other experiments (and as we intend in practice) the word is included in the context. In all cases, the ability of this metric to capture relevancy is essential for the success of that embedding to be applied to real world corpora, where not all neighboring words are relevant.

The task is as follows. We have available to us some databases of words and their related words, which we view as that word’s relevant context. We create the ground-truth by setting the labels of related word-context pairs to be , and for randomly picked word pairs to be . Specifically, R1 and R2 are constructed from the dataset in (Arora et al., 2016), and R3 is a newly provided much larger test set, separately constructed from WordNet.

In R1, the negative samples are created by keeping the word unchanged and sampling random contexts. In R2, the context is unchanged, and random words are provided. Note that for each word in each of the tests above, there is a single example with label . In total, there are 137 words and 534 senses, with on average 6.98 context words per query word. Some examples are provided in Table 6.

R3 is a new test set that significantly augmnents R1 and R3. Here, we manually collect a set of polysemous words666https://en.wikipedia.org/wiki/List_of_true_homonyms and retrieve all their senses from WordNet (Leacock and Chodorow, 1998). We combine the definitions, synonyms, and examples sentences in WordNet as the context for each sense. We have tests and for each test have negative samples, with random words in unchanged context. In total, there are 1938 words, 3234 senses, and on average 7.88 contexts per word, with some examples provided in Table t-WCRR3.

For each valid pair, we measure the Spearman correlation (Sp.) between relevance metrics and ground-truth labels, as well as the Precision@1, i.e. the fraction of tests where the top item based on predicted scores is the valid pair. The reported performance metrics are averaged over all valid word-context pairs.

Relevant? Word Context
Example test 1
True tie neck men front worn collar knot cloth decorative
False tie throw propel ball direction basket goal game attain
Example test 2
True tie winner score tied completion identical results sports
False tie domestic hog pig culinary eaten cooked fat
Table 6: Two example WCR-R1 tests for two senses of word tie. One valid (positive) word-context pair and a random (negative) pair are provided for each test.
Relevant? Word Context
Example test 1
True bank slope turn road track higher inside order reduce effects force
False purposes slope turn road track higher inside order reduce effects force
Example test 2
True bank financial institution accepts deposits channels money activities check holds home
False lip financial institution accepts deposits channels money activities check holds home
Table 7: Two example WCR-R3 tests for two valid senses of word bank. One valid (positive) word-context pair and a random (negative) pair are provided for each test.

To further visualize the performance of each embedding on this task, we plot the distribution of the relevance metric for relevant and Figure 4 for the compared methods. It is clear that (Chen et al., 2014) has the best separation, corresponding to the highest score in the WCR columns of Table 5. This can partially be explained by the fact that the construction of the embedding of (Chen et al., 2014) uses WordNet, which is also used in the construction of these tests.777R1 and R2 are constructed using a mixture of WordNet and human judgement; see (Arora et al., 2016).

Figure 4: The distribution of cosine/Euclidean distances between (ir)relevant word and context cluster centers in the WCR R1 test for (Neelakantan et al., 2015) texttt300d_10s_1.32c_0mv embedding (top), (Huang et al., 2012) cossim embedding (middle), and (Chen et al., 2014) (bottom). The amount of separation between the two distributions seems correlated with the success of the embedding on the WCR tasks. In comparison, our methoduses vector norms to distinguish relevance, the separation of which is shown in Figure 1.

Stanford’s Contextual Word Similarities (SCWS) The most popular existing test set for evaluating contextual word embedding similarity is the SCWS test (Huang et al., 2012), which contains tests, each consisting of two word-context pairs and . At all times, is relevant to , and to , but in the context of may not be synonymous with in the context of . An example is given in Table 8. In our evaluation, we first prune the test set to only include words present in vocabularies available to all embeddngs. Following (Huang et al., 2012), we sort all the test pairs based on predicted similarity score and compare such ranking against the ground-truth ranking indicated by the average human evaluation score. The distance between two rank-lists is measured using the Spearman correlation score.

Example of SCWS test for admission and confession
… the reason the Buddha himself gave was that the admission of women would weaken the Sangha and shorten its lifetime …
… They included a confession said to have been inadvertently included on a computer disk that was given to the press…
avg. human-given sim. score: 2.3
Table 8: An example of a pair of word-contexts for a single SCWS task.

We note that in (Huang et al., 2012) the similarity between two word-context pairs is the measured using avgSimC, a weighted average of cosine similarities between all possible representation vectors of and . This metric, however, can not be applied to our approach since we have an infinite number of possible contextual representation for each word. Therefore, we use the cosine similarity without averaging, which is reasonable for all the embedding approaches. We note that the cosine similarity is also used in (Neelakantan et al., 2015; Reisinger and Mooney, 2010a) Of course, this is disadvantageous for the embeddings of (Huang et al., 2012), and our scores of their embedding are closer to that reported in (Neelakantan et al., 2015), which also does not use averaging.

Our Contextual Word Similarity (CWS) We expand upon the SCWS test by providing our own larger CWS test, constructed mostly in an unsupervised manner based on WordNet. We retrieve a set of multisense words and their senses from WordNet as in WCR R3, with contexts as the concatenation of the definition and all example sentences. The full list of tests are attached in the dataset folder, with an example in Table 9. Note that compared to the SCWS test set, the contexts are much shorter and less noisy.

For each query word-context pair, we attach a positive label to another if and , are similar words. We collect negative samples as pairs if , where either or and are marked similar words by WordNet in the context . Given a query pair, the goal is to rank the similar (positive) pairs above the dissimilar (negative) ones in the context of . In all, we create a set of tests based on polysemous words. We calculated the cosine similarity between the contextual embeddings of negative/positive samples and the query, and report scores in Table 5.

Label Word Context
query coach sports charge training athlete team
True manager sports charge training athlete team
False bus vehicle carrying passengers public transport rode work
False coach carriage pulled horses driver
Table 9: Example test in Contextual Word Similarity dataset.

Word-Sense Classification (WSC) Both WCR and CWS tasks are heavily based on WordNet, and offer an unfair advantage to multisense embeddings whose construction is also based on WordNet. In this sense, SCWS offers a more generalizable evaluation. Two additional word-sense tests are and (Koeling et al., 2005; Bartunov et al., 2015). Similar to the Word-Sense Induction (WSI) task provided by the same works, we devise a Word Sense Classification (WSC) task, to predict the correct sense of a polysemous word in a given sentence or paragraph.

We construct the tests from sense-labeled pairs in (Koeling et al., 2005) (C1) and (Raganato et al., 2017) (C2) by merge all the training and test data and further remove rare senses with examples sentences. Some examples of the C1 dataset are provided in Table 10 (C2 is very similar). In total, C1 contains 39 words, 116 senses, and 11,064 lines. C2 is much bigger, containing 783 words, 5,188 senses, and 961,670 examples. Given that such large datasets were already available, we did not need to create our own. For each word we create train-test splits, train a

-NN multiclass classifier with Euclidean distance between contextual word embeddings, and report the

mean classification accuracy (Acc) averaged across all words.

Word Sense Context
Example test 1
coach1:06:02:: coach It was perfect for low fare express coach services.
coach1:18:01:: coach But Chicago coach Phil Jackson said the bulls had done a good team job of preventing Malone from getting the easy transition baskets he thrives on nn even when the lumbering longley could not keep up with him.
coach1:06:02:: coach Police said special trains and coaches had already been booked in Belgium alone for Sunday’s march for jobs.
Example test 2
right1:07:00:: right The buyer can exercise this right by refusing to take delivery or informing the seller that he rejects the goods.
right10:15:00:: right Woods hit first and stuck his approach about feet to the right of the pin.
Table 10: Example data (word, context sentence, and word-sense label) in Word-Sense Classification WSC-C1 dataset. The word-sense labels follow the format in WordNet.

Discussion We have provided a wide array of evaluations for measuring different aspects of multisense word embeddings, both collected from existing test sources and formed ourselves through WordNet. Overall, we find that our simple model performs surprisingly well on all evaluations, with the only consistent competitor a WordNet based model.

One thing to note is that the SCWS Spearman scores of the (Huang et al., 2012) listed here are much smaller than that first reported. This is entirely attributed to the fact that we use direct cosine similarity between word embeddings, whereas they use an averaged similarity across their provided context words. Both are perfectly valid metrics; our choice is solely so that the identical metric can be applied to all embeddings, where this averaged similarity metric cannot be used.

5 Conclusion

In this paper, we developed a method that can yield contextual word embeddings for any word under any context. When the context is irrelevant to the word, the embedding norms will be almost 0. A key highlight of our method is the simplicity, both from the modeling and the learning point of view. Experiments on several datasets and on several tasks show that the method we propose is competitive with the state of the art when it comes to unsupervised methods to learn polysemous word representations.