Inducing and Embedding Senses with Scaled Gumbel Softmax

04/22/2018 ∙ by Fenfei Guo, et al. ∙ University of Massachusetts Amherst University of Maryland 0

Methods for learning word sense embeddings represent a single word with multiple sense-specific vectors. These methods should not only produce interpretable sense embeddings, but should also learn how to select which sense to use in a given context. We propose an unsupervised model that learns sense embeddings using a modified Gumbel softmax function, which allows for differentiable discrete sense selection. Our model produces sense embeddings that are competitive (and sometimes state of the art) on multiple similarity based downstream evaluations. However, performance on these downstream evaluations tasks does not correlate with interpretability of sense embeddings, as we discover through an interpretability comparison with competing multi-sense embeddings. While many previous approaches perform well on downstream evaluations, they do not produce interpretable embeddings and learn duplicated sense groups; our method achieves the best of both worlds.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many machine learning models for natural language processing applications represent words with

embeddings, which are vectors of real-valued numbers. Popular word embedding models such as Word2Vec Mikolov et al. (2013a, b) and GloVe Pennington et al. (2014)

have been instrumental in achieving state-of-the-art results on many NLP tasks, such as sentiment analysis 

Kim (2014); Tai et al. (2015) and textual entailment Parikh et al. (2016); Chen et al. (2017).

Despite their success, word embeddings are not perfect, particularly for polysemous words (those with multiple senses). For example, Word2Vec and GloVe learn a single embedding for each type, which in the case of polysemous words conflates many different and possibly unrelated meanings (e.g., “I squashed the bug with my shoe” vs. “I fixed the bug in my code”). To overcome this limitation and learn finer grained semantic clusters in the embedding space, word sense embedding models (Section 7) learn multiple representations for polysemous words, each corresponding to a specific meaning.

Unsupervised word sense embedding models have two key functions: (1) automatically inducing word senses from unlabeled data, and (2) learning sense-specific representations associated with the inferred sense. In this paper, we jointly learn both functions with an efficient, fully-differentiable model that obtains strong results on downstream tasks while also maintaining interpretability.

Our approach differs from prior work in that it implements discrete sense selection using differentiable hard attention. Existing discrete methods suffer from non-differentiability due to sense sampling Huang et al. (2012); Tian et al. (2014); Neelakantan et al. (2014); Qiu et al. (2011), which results in inefficient training and implementational complexity. On the other hand, soft selection with regularization methods Šuster et al. (2016)

are unable to fully focus on one sense at a time during the training. Closest to our own method, Muse implement discrete sense selection with hard attention through reinforcement learning; however, crowdsourced interpretability evaluations reveal that their reinforcement learning approach does not actually learn distinct sense embeddings.

Concretely, we propose a fully-differentiable multi-sense extension of the Word2Vec skip-gram model (Section 2) that approximates discrete sense selection with a scaled variant of the Gumbel softmax function (Section 3). After qualitatively inspecting the nearest neighbors of competing approaches, we conclude that downstream task performance does not adequately measure a model’s ability to discover word senses, which motivates a human evaluation. Our model outperforms baseline systems on both types of evaluations (Section 5); the proposed Gumbel softmax variant is critical for balancing downstream performance with embedding interpretability.

2 Background: Word2Vec

Our sense embedding model extends the canonical Skip-Gram Word2Vec model with negative sampling Mikolov et al. (2013a, b). In this section, we provide a brief overview.

Word2Vec learns two sets of parameters, a word embedding matrix and a context embedding matrix , where represents the vocabulary and is the dimension of embedding space. Both embeddings, W and C

, are learned by maximizing the probability of the context words

that surround a given pivot word in a context ,


In practice, is usually approximated using negative sampling (as opposed to computing a softmax over the vocabulary), which greatly accelerates training.

Multi-sense variants

Equation 1 conflates all senses of a word into a single embedding. To address this problem, previous works modifies Word2Vec to learn multiple sense-specific embedding per word. This sense induction component can be implemented by clustering context vectors Neelakantan et al. (2014) or by probabilistically modeling the contextual sense induction distribution directly Li and Jurafsky (2015); Šuster et al. (2016); Lee and Chen (2017). See Section 7 for more details.


Figure 1: Overview of gasi model. Given a pivot word (bond) and some context words (in blue), we first select which of the pivot word’s sense embeddings to use given the context via the scaled Gumbel softmax. Then, we follow the standard skip-gram objective with negative sampling to update the pivot and context embeddings.

3 Gumbel-Attention Sense Induction (gasi)

In contrast to previous work, our model discretely selects senses while retaining full differentiability. Muse show that non-differentiable discrete sense selection significantly outperforms all other methods on sense similarity related downstream tasks; however, it also does not learn interpretable sense embeddings for most words (Section 5). The sense embeddings learned by our model are more interpretable than prior approaches while also outperforming non-differentiable hard attention on downstream evaluations.

Here, we first provide an overview of our sense embedding framework and training objective, which can be instantiated with either soft or discrete sense selection. We then describe our discrete version, Gumbel Attention Sense Induction (gasi), which approximates hard attention over sense embeddings with a scaled Gumbel softmax function.

3.1 Attentional Sense Induction

Like Word2Vec, we jointly learn two sets of parameters: the same context embedding matrix as before,111Following previous work Neelakantan et al. (2014); Li and Jurafsky (2015), we do not consider senses of context words.

and a sense embedding tensor

. Unlike previous approaches that maintain extra parameters for sense induction Neelakantan et al. (2014); Lee and Chen (2017), we infer the sense disambiguation distribution using only the embedding parameters. Each word is initialized with sense embeddings; after training, we perform a pruning post-processing step to remove duplicate and unused senses (Section 4.1).

Assuming pivot word has senses , we expand the co-occurrence probability of given and local context


where is computed through an attentional sense selection mechanism.

We use to represent the embedding for pivot sense from the sense embedding tensor S, and for the global embedding of context word . Replacing in Equation 1 with the sense expansion in Equation 2 gives


The second term (the contextual sense induction distribution) is conditioned on the entire local context . We can implement a baseline soft attention sense induction (sasi) model by computing this distribution with a softmax:


where is the mean of the context vectors in .

Augmenting the Objective for Negative Sampling:

For modeling the probability of the context given the pivot sense, , similar to its analogous term in Word2Vec, i.e., the probability of the context given the pivot word , we use the softmax function to compute this term,


Computing the softmax over the whole vocabulary is extremely time-consuming. We want to adopt the negative sampling methods to approximate the logarithm of the softmax function. However, our model includes the sense induction term in Equation 3 that marginalizes over possible senses inside the logarithm function, and doesn’t exist explicitly in our objective function.

However, given the concavity of the logarithm function, we can apply Jensen’s inequality,


to create a lower bound of the objective. Just as in variational inference Jordan et al. (1999), maximizing this approximation gives us a tractable objective that we can optimize. The new objective function then becomes



is estimated by Equation

7 and is estimated by the negative sampling term,


where is the negative sampling distribution and is the number of negative samples drawn for each context-pivot pair.

3.2 Differentiable Discrete Sense Selection with Scaled Gumbel Softmax

In language, senses are not softly selected: with the exception of inneundo and jokes, most sentences use a single, discrete sense per word. Therefore, it is better to exploit each sense and captures as much semantic information as possible for each sense-specific embedding. In practice, most previous approaches Neelakantan et al. (2014); Li and Jurafsky (2015); Qiu et al. (2016) apply hard sampling and select one specific sense per training step. However, discrete selection in this manner is non-differentiable, which normally necessitates the use of policy gradient methods such as reinforce Williams (1992)

that rarely work in practice without variance reduction techniques.

To preserve differentiability, we apply the Gumbel softmax  Jang et al. (2016); Maddison et al. (2016) to approximate hard attention. Observing that naïve application fails to learn interpretable sense vectors, we modify the Gumbel softmax by adding a scaling factor. This modification, which we call the scaled Gumbel softmax, is critical for learning interpretable embeddings that also perform well on downstream tasks.

3.2.1 Gumbel Softmax for Categorical Sampling

The original Gumbel softmax approximates the sampling of discrete random variables. Given a discrete random variable

with , where is unnormalized and , the Gumbel-max Luce (1959); Yellott (1977); Papandreou and Yuille (2011); Hazan and Jaakkola (2012); Maddison et al. (2014) refactors the sampling of into a deterministic function


where the Gumbel noise are i.i.d samples drawn from Gumbel(0,1), which can be sampled by drawing Uniform(0, 1), and .

This categorical sampling can be approximated by the Gumbel softmax, which replaces the softmax in our sense disambiguation distribution (Equation 4). Concretely, the argmax


is approximated with

Figure 2: As the scale factor increases, the sense selection distribution for “bond” becomes flatter, which leads to increased sense mixing.

3.2.2 Scaled Gumbel Softmax for sense disambiguation distribution inference

In practice, however, the vanilla Gumbel softmax learns flat sense distributions even with low temperatures; thus, it cannot learn sense disambiguation (bottom right of Figure 2). To solve this problem, we propose a simple variant of the Gumbel softmax that scales the Gumbel noise term based on the following empirical analysis.

Observing the flat distribution learned by Gumbel softmax (Figure 2), we monitor the value of the context-sense dot product which we use to estimate the contextual sense induction distribution in Equation 11. The mean of this value converges quickly in the early stage of training; and, compared to the Gumbel noise , ranges in a small window222Empirically, given float32 precision, with the usage of , the softmax saturates and the gradients vanish at a fixed value of . compared to the variance of the Gumbel noise term  (Figure 3). Therefore, dominates the estimation of after applying the Gumbel softmax, and this trend continues throughout training, which severely hampers learning of the sense disambiguation distribution.

Given the above analysis, we mitigate the influence of the Gumbel noise in Equation 11 (Figure 3) with a scaling factor

, which we tune as a hyperparameter:


The final objective function for our model, Gumbel attention sense induction (gasi), becomes


Figures 24 show that the scale factor balances the influence of the Gumbel noise and is critical for learning.

Figure 3: Our attention mechanism is a function of the context-sense dot product (Equation 11), whose mean and std plotted here as a function of iteration. The shadowed area shows that this term has a smaller scale compared to the noise term , such that , rather than the embeddings, dominates the sense distribution. To correct this, we add a scaling term to the Gumbel noise .

4 Training gasi

For fair comparisons, we try to remain consistent with previous work in all aspects of training. In particular, we train gasi on the same April 2010 Wikipedia snapshot Shaoul C. (2010) with one billion tokens used in previous work Huang et al. (2012); Neelakantan et al. (2014); Lee and Chen (2017) . Additionally, we use the same vocabulary of 100k words and initialize our model with three senses per word. During training, we fix the window size to five and the dimensionality of the embedding space to 300.333

We initialize both sense and context embeddings randomly within U(-0.5/dim, 0.5/dim) as in Word2Vec. We set the initial learning rate to 0.01; it is decreased linearly until training concludes after 5 epochs. The batch size is 512, and we use five negative samples per pivot / context pair as suggested by mikolov2013a

Our model is initialized with a fixed number of senses for all of the words in the vocabulary. For comparison to previous work Huang et al. (2012); Neelakantan et al. (2014); Lee and Chen (2017), we set the number of senses for all experiments unless otherwise specified.

4.1 Pruning Duplicate Senses

Some previous work Neelakantan et al. (2014); Li and Jurafsky (2015) infers the number of senses for a given word dynamically during training, instead of learning a fixed number of senses per word. Instead of integrating this functionality into our training, we handle it post-training. For words that do not have multiple senses or have most senses appear very low-frequently in corpus, our model (as well as many previous models) learns duplicate senses. We can easily remove such duplicates by pruning the learned sense embeddings with a threshold . Specifically, for each word , if the cosine distance between any of its sense embeddings () is smaller than , we consider them to be duplicates. After discovering all duplicate pairs, we start pruning with the sense that has the most duplications and keep pruning with the same strategy until no more duplicates remain.

Model-specific pruning:

Since we would like to apply pruning not only to our model but to others, we propose a model-agnostic strategy to estimate the threshold . We first sample 100 words from the negative sampling distribution over the vocabulary. Then, we retrieve the five nearest neighbors (from all senses of all words) to each sense of each sampled word. If one of a word’s own senses appears as a nearest neighbor, we append the distance between them to a sense duplication list . For other nearest neighbors, we append their distances to the word neighbor list . After populating the two lists, we want to choose a threshold that would prune away all of the sense duplicates while differentiating sense duplications with other distinct neighbor words. Thus, we compute


The post-hoc analysis with human evaluation (Table 5) and the post-pruning word sense histogram (Figure 4) corroborate its effectiveness. This pruning only slightly reduces downstream performance on the word similarity task(Table 1, bottom).

Figure 4: Histogram of number of senses left per word after pruning. More multi-sense words are discovered by a smaller gasi scale factor, and muse fails to uncover distinct senses.

5 Downstream Evaluation

All prior work on word sense embeddings has evaluated embedding quality using downstream evaluations, namely semantic word similarity and synonym selection. We evaluate on both of these tasks and show competitive or state-of-the-art results when compared with baseline models. We also evaluate our model on word sense disambiguation tasks to show that our model learns a reasonable sense selection mechanism. With that said, inspecting the nearest neighbors of sense embeddings learned by some of the highest-scoring models on these tasks (e.g., gasi-1, muse) reveals that they do not learn distinct sense embeddings. In Section 6, we design a crowdsourced evaluation to measure sense interpretability, which shows that properly-scaled gasi models (along with soft attention variants) learn far more distinct word senses than muse Lee and Chen (2017).

5.1 Word similarity

To examine how well the learned sense embeddings capture semantic similarities, we evaluate our model on the Stanford Contextual Word Similarities (SCWS) dataset Huang et al. (2012), which contains 2003 word pairs with contexts and a human rated similarity score for each pair .

We measure how close the model’s estimation of word similarity is to the human judgment by computing the Spearman’s correlation . Word similarity by sense embeddings is computed in three different ways Reisinger and Mooney (2010):

  • [leftmargin=*]

  • MaxSimC

    : cosine similarity

    between the two most probable senses and that maximizes the sense induction distribution .

  • AvgSim: averaged cosine similarity over the combinations of all senses for a given word pair,

  • AvgSimC: weighted average similarity over the combinations of all senses for a given word pair,

Model MaxSimC AvgSim AvgSimC
Huang-50d 26.1 62.8 65.7
Neelakantan 59.3 67.2 69.2
Neelakantan-NP 60.1 67 68.6
Tian 63.6 65.4
Li 66.6 66.8
Qiu 64.9 66.1
MUSE_Boltzmann 67.9 68.7 68.7
sasi 55.1 64.8 67.8
gasi-0.2 56.5 65.3 68.2
gasi-0.4 65.1 66.3 69.3
gasi-0.5 67.2 66.5 69.1
gasi-1.0 68.2 67.8 68.3
gasi-0.4-pruned 64.6 63.7 68.9
gasi-0.5-pruned 66.7 64.2 68.6
Table 1: Spearman’s ranking correlation on the SWCS word similarity dataset. The indicates models that support a variable number of senses per word. gasi obtains competitive or state-of-the-art performance on all metrics.

We compare our model with previous work in Table 1; the aforementioned muse Lee and Chen (2017) achieved the previous state of the art on this task. Baseline models are explained in more detail in Section 7. gasi outperforms sasi and all baselines other than muse on all three metrics. muse achieves a higher AvgSim444This value is computed with the learned embeddings and code released by authors. score than all gasi variants; however, a high AvgSim can be achieved by a complete flat sense induction distribution (Equation 15 and 16), as we show in Section 6.

By varying the scale factor from 0.0 (which reduces to sasi) to 1.0, we can see a consistent increase in the MaxSim while the MaxSimC drops after 0.4. Taken together with the qualitative example in Figure 2, these results suggest that the scale factor controls both the number of distinct senses learned as well as the representational quality (measured by word similarity).

5.2 Synonym selection

Synonym Selection is another common evaluation for word/sense representations Jauhar et al. (2015a); Ettinger et al. (2016); Lee and Chen (2017): ESL-50 Turney (2001), RD-300 Jarmasz and Szpakowicz (2004), and TOFEL-80 each consist of a target word and four candidates . These datasets do not provide contexts for the target word. Without contexts, we follow jauha_onto and compute the cosine similarity between senses of each candidate and that of the target words. Then, we select the synonym whose sense has the maximum similarity to any of the target senses.

Model ESL-50 RD-300 TOFEL-80
Unsupervised monolinugal sense embeddings
Li 50.00 55.36 82.61
Neelakantan 57.14 58.93 78.26
MUSE-Boltzmann 64.29 66.07 88.41
gasi-0.4 63.36 67.27 86.69
Retrofitting on ontologies or parelle texts
Retro-GC 63.64 66.20 71.01
Retro-SG 56.25 65.09 73.33
Paralle Text (PD) 66.7 74.7 81.8
WordNet (WN) 68.8 62.1 80.5
PD-WN 70.8 79.3 84.4
Table 2: Synonym selection accuracy with different embedding models; gasi and muse again achieve similarly high scores.

We compare our results with other multi-sense embeddings in Table 2. Although our unsupervised model cannot compete with models Jauhar et al. (2015b); Ettinger et al. (2016) that retrofit on external resources on EST-50 and RD-300, it achieves the best performance on RD-300 among unsupervised sense embeddings and is competitive with muse on the other two.

5.3 Word sense disambiguation

We further compare our modeld with two baselines Neelakantan et al. (2014); Lee and Chen (2017) on four word sense disambiguation (WSD) test sets from the Senseval/SemEval series: Senseval-2 Edmonds and Cotton (2001), Senseval-3 Snyder and Palmer (2004), SemEval-2013 Navigli et al. (2013), and SemEval-2015 Moro and Navigli (2015). We train on SemCor 3.0 Miller et al. (1994), which contains sentences that have multiple words to be disambiguated.

WSD based on contextual sense inducton

To focus our evaluation on the word sense disambiguation captured by the sense induction component of each model, we map each sense to a set of synsets given its part-of-speech tag(POS). Specifically, for each (sense , synset , POS ) tuple for a given word type , we accumulate the probability for all tokens that are assigned to synset :


Then, we average over all possible synsets. At test time, we assign each instance of the target word given its POS tag to the synset that maximizes


We apply the same methodology for all sense embedding (Table 3). Although our results are worse than a state-of-art WSD system, IMS+emb Iacobacci et al. (2016), gasi achieves a higher F1 score than either of the baselines.

Model Noun Verb Adj Adv All
WSD based on contextual sense induction distribution
Neelakantan 68.6 50.3 73.6 81.0 66.4
MUSE-Boltzmann 68.5 49.7 73.3 81.0 66.1
gasi-0.4 69.5 50.6 74.0 81.0 67.1
State-of-art WSD system
IMS + emb 71.9 56.6 75.9 84.7 70.1
Table 3: F1 score on all four WSD tasks; gasi outperforms other word sense embedding baselines.

6 Judging Interpretability via Crowdsourcing

Word Nearest Neighbors
Nearest Neighbors by GASI-0.4
foot 1 knee, kneecap, forearm, leg, toe, heel, cheekbone
foot 2 50-foot, meters, paces, meter, 40-foot, six-foot
foot 3 grenadier, fusilier, rifles, corunna, colonelcy
bond 1 007, octopussy, brosnan, thunderball
bond 2 cdo, securities, refinance, surety, debenture
bond 3 atom, carbons, molecule, covalent, substituent
Nearest Neighbors by MUSE-Boltzmann
foot 1 meters, shoulder, toe, metres, six-foot, side
foot 2 six-foot, knee, surmounting, toe, meters, leg, ankle
foot 3 metre, meters, six-foot, shoulder, toes, ft, paces
bond 1 thunderball, goldfinger, octopussy, moonraker
bond 2 thunderball, moonraker, octopussy, blofeld,
bond 3 thunderball, octopussy, moonraker, goldfinger
Table 4: Comparison of nearest neighbors learned by gasi-0.4 from MUSE-Boltzmann (bottom).

The previous section shows that gasi achieves state-of-art or competitive results on downstream tasks. However, these results do not tell us much about the information captured by each sense embedding, or how many distinct senses the model learns per word. One way to interpret the learned sense embeddings is by looking at the nearest neighbors for a sample of words. From table 4, we can see that gasi is able to learn meaningful distinct sense groups for each word. In contrast, MUSE-Boltzmann Lee and Chen (2017) learns near duplicate senses for many examples (e.g., Table 4). To provide a quantitive interpretability evaluation at a larger scale, we design a crowdsourcing task that measures how many times the sense chosen by a model based on its contextual sense distribution agrees with human judgements.

Task description

For a given target word in a given context, we ask a worker to select one sense group among the three learned by the model that best fits the given sentence. Each sense group is described by its top-10 distinct nearest neighbors, an example is shown in Figure 5.

Data collection

We select a subset of nouns from SemCor 3.0 Miller et al. (1994) to use for this task. In particular, we first filter all synsets in the dataset that have less than ten sentences, and rank the remaining nouns by the number of synsets and select the top 50, randomly selecting five sentences per word for the task. For each embedding model, we obtain three annotations on the sentence / noun pairs using the Crowdflower platform.

Figure 5: User interface with an example question for Sense Agreements Crowdsourcing task
Analysis of results

In the first block of Table 5, we see that sasi achieves the highest accuracy, followed by gasi-0.4. Both are higher than the random baseline of , unlike muse. We apply the Fleiss’  Fleiss (1971) to measure the inter-rater reliability (IRR) of our multiple-choice task. The IRR numbers reported in Table 5 are very low for most of the models. There are two possible causes: (1) the model failed to yield interpretable sense groups, and/or (2) the model learned duplicate senses.

To isolate the cause, we apply a post-hoc pruning analysis. To be more specific, we prune each model’s learned embeddings with the strategy described in Section 4.1 and then re-assign the user’s choice to its nearest neighbor sense. The second block in Table 5 shows that after pruning, both the accuracy and IRR score increase significantly for the gasi models. This result demonstrates the efficacy of our pruning method. We also observe that pruning does not help sasi or the MSSG model Neelakantan et al. (2014) since very few senses were pruned; in contrast, almost all of muse’s senses are pruned, but the human accuracy actually decreases.555We do not consider words pruned to a single sense in the accuracy computation.

7 Related Work

Our work adds to the existing body of research on learning unsupervised word sense embeddings. In this section, we compare and contrast these previous methods to gasi.

Reisinger2010 were the first to propose a multi-sense semantic vector-space model. Several variants of this idea (including gasi) were later implemented as extensions of Word2Vec Mikolov et al. (2013a, b). Each of these induces senses using one of three techniques:

  1. [leftmargin=*]

  2. Supervised methods trained on annotated sense corporaIacobacci et al. (2015) or external sense inventories and knowledge bases like WordNet Chen et al. (2014); Jauhar et al. (2015b); Chen et al. (2015) and Wikipedia Wu and Giles (2015);

  3. Bilingual sense induction from multilingual parallel corporaGuo et al. (2014); Šuster et al. (2016); Ettinger et al. (2016);

  4. Unsupervised monolingual models attempt to induce senses using various methods, such as context clustering Huang et al. (2012); Neelakantan et al. (2014); Li and Jurafsky (2015), corpus-level probability estimation Tian et al. (2014), context-based energy functions Qiu et al. (2016), and reinforcement learning Lee and Chen (2017). gasi falls into this category.

Model Accuracy Fleiss’
MUSE 28 0.33 0.13
Neelakantan 44 0.37 0.24
sasi 56.4 0.41 0.30
gasi-0.4 48 0.40 0.18
gasi-0.5 40 0.36 0.18
Post-hoc analysis afer pruning () show # of instances left
MUSE (75) 26.6 0.20 0.13
Neelakantan (245) 44.5 0.33 0.24
sasi(250) 55.6 0.42 0.30
gasi-0.4 (185) 69.7 0.41 0.43
gasi-0.5 (125) 73.6 0.40 0.48
Table 5: Human evaluation results on sense agreement, where is the average probability assigned by the model to the human choices. The pruned gasi achieves the best performance, while muse does worse than random.

These methods also differ in how they disambiguate senses in context. Most previous approaches rely on discrete sampling based on the sense induction distribution (or computing the argmax), which loses model differentiability.  vsuster2016bilingual maintain differentiability by using soft attention, but due in part to sense mixing, their monolingual version performs poorly on downstream tasks. Muse try to address this problem by applying hard attention for discrete sense selection with reinforcement learning. While this approach achieves high scores on downstream evaluation tasks, we show that it does not learn distinct interpretable sense embeddings in Section 5. gasi accomplishes the best of both worlds, avoiding sense mixing through hard, differentiable attention while also achieving high interpretability.

8 Conclusion

In this paper, we propose to learn word sense embeddings through Gumbel attention sense induction (gasi). Our model applies differentiable hard attention to simultaneously induce and embed word senses from an unlabeled corpus. We introduce a scaling factor to the Gumbel softmax that allows gasi to learn sense disambiguation and achieve competitive or state-of-art performance on similarity based downstream evaluations. Furthermore, we show that performance on these evaluation tasks does not necessarily correlate to increased interpretability. Motivated by this observation, we design a human evaluation task to quantitatively measure how well the model’s sense selection mechanism correlates to that of humans, on which gasi performs better than competing approaches.


  • Chen et al. (2017) Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced lstm for natural language inference. In Proceedings of the Association for Computational Linguistics.
  • Chen et al. (2015) Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. 2015.

    Improving distributed representation of word sense via wordnet gloss composition and context clustering.

    In Proceedings of ACL-IJCNLP. Association for Computational Linguistics, pages 15–20.
  • Chen et al. (2014) Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and disambiguation. In Proceedings of EMNLP. Citeseer.
  • Edmonds and Cotton (2001) Philip Edmonds and Scott Cotton. 2001. Senseval-2: overview. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems.
  • Ettinger et al. (2016) Allyson Ettinger, Philip Resnik, and Marine Carpuat. 2016. Retrofitting sense-specific word vectors using parallel text. In Proceedings of NAACL. pages 1378–1383.
  • Fleiss (1971) Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76(5):378.
  • Guo et al. (2014) Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Learning sense-specific word embeddings by exploiting bilingual resources. In COLING. pages 497–507.
  • Hazan and Jaakkola (2012) Tamir Hazan and Tommi Jaakkola. 2012. On the partition function and random maximum a-posteriori perturbations. In Proceedings of ICML.
  • Huang et al. (2012) Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of ACL.
  • Iacobacci et al. (2016) Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Embeddings for word sense disambiguation: An evaluation study. In Proceedings of ACL.
  • Iacobacci et al. (2015) Ignacio Iacobacci, Taher Mohammad Pilehvar, and Roberto Navigli. 2015. Sensembed: Learning sense embeddings for word and relational similarity. In Proceedings of ACL-IJCNLP. Association for Computational Linguistics, pages 95–105.
  • Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 .
  • Jarmasz and Szpakowicz (2004) Mario Jarmasz and Stan Szpakowicz. 2004. Roget’s thesaurus and semantic similarity. In Recent Advances in Natural Language Processing III: Selected Papers from RANLP.
  • Jauhar et al. (2015a) Sujay Kumar Jauhar, Chris Dyer, and Eduard Hovy. 2015a. Ontologically grounded multi-sense representation learning for semantic vector space models. In Conference of the North American Chapter of the Association for Computational Linguistics.
  • Jauhar et al. (2015b) Sujay Kumar Jauhar, Chris Dyer, and Eduard Hovy. 2015b. Ontologically grounded multi-sense representation learning for semantic vector space models. In Proc. NAACL.
  • Jordan et al. (1999) Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. 1999. An introduction to variational methods for graphical models. Machine Learning 37(2):183–233.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of Empirical Methods in Natural Language Processing.
  • Lee and Chen (2017) Guang-He Lee and Yun-Nung Chen. 2017. Muse: Modularizing unsupervised sense embeddings. In Proceedings of EMNLP.
  • Li and Jurafsky (2015) Jiwei Li and Dan Jurafsky. 2015. Do multi-sense embeddings improve natural language understanding? arXiv preprint arXiv:1506.01070 .
  • Luce (1959) R Duncan Luce. 1959. Individual Choice Behavior: A Theoretical Analysis. New York: Wiley.
  • Maddison et al. (2016) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2016. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712 .
  • Maddison et al. (2014) Chris J Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling. In Proceedings of NIPS.
  • Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 .
  • Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS.
  • Miller et al. (1994) George A Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G Thomas. 1994. Using a semantic concordance for sense identification. In Proceedings of the workshop on Human Language Technology.
  • Moro and Navigli (2015) Andrea Moro and Roberto Navigli. 2015. Semeval-2015 task 13: Multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015).
  • Navigli et al. (2013) Roberto Navigli, David Jurgens, and Daniele Vannella. 2013. Semeval-2013 task 12: Multilingual word sense disambiguation. In Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013).
  • Neelakantan et al. (2014) Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of EMNLP.
  • Papandreou and Yuille (2011) George Papandreou and Alan L Yuille. 2011. Perturb-and-map random fields: Using discrete optimization to learn and sample from energy models. In Proceedings ICCV.
  • Parikh et al. (2016) Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016.

    A decomposable attention model for natural language inference.

    In Proceedings of Empirical Methods in Natural Language Processing.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP.
  • Qiu et al. (2011) Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2011. Opinion word expansion and target extraction through double propagation. Computational linguistics 37(1):9—27.
  • Qiu et al. (2016) Lin Qiu, Kewei Tu, and Yong Yu. 2016. Context-dependent sense embedding. In Proceedings of EMNLP.
  • Reisinger and Mooney (2010) Joseph Reisinger and Raymond J. Mooney. 2010. Multi-prototype vector-space models of word meaning. In Proceedings of NAACL.
  • Shaoul C. (2010) Westbury C Shaoul C. 2010. The Westbury Lab Wikipedia Corpusa.
  • Snyder and Palmer (2004) Benjamin Snyder and Martha Palmer. 2004. The english all-words task. In Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text.
  • Šuster et al. (2016) Simon Šuster, Ivan Titov, and Gertjan van Noord. 2016.

    Bilingual learning of multi-sense embeddings with discrete autoencoders.

    In Proceedings of NAACL.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015.

    Improved semantic representations from tree-structured long short-term memory networks.

    In Proceedings of the Association for Computational Linguistics.
  • Tian et al. (2014) Fei Tian, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang, Enhong Chen, and Tie-Yan Liu. 2014. A probabilistic model for learning multi-prototype word embeddings. In Proceedings of COLING. pages 151–160.
  • Turney (2001) Peter D Turney. 2001. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In European Conference on Machine Learning.
  • Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3-4).
  • Wu and Giles (2015) Zhaohui Wu and C Lee Giles. 2015. Sense-aaware semantic analysis: A multi-prototype word representation model using wikipedia. In AAAI. Citeseer, pages 2188–2194.
  • Yellott (1977) John I Yellott. 1977.

    The relationship between luce’s choice axiom, thurstone’s theory of comparative judgment, and the double exponential distribution.

    Journal of Mathematical Psychology 15(2):109–144.