MUSE: Modularizing Unsupervised Sense Embeddings

04/15/2017 ∙ by Guang-He Lee, et al. ∙ MIT IEEE 0

This paper proposes to address the word sense ambiguity issue in an unsupervised manner, where word sense representations are learned along a word sense selection mechanism given contexts. Prior work about learning multi-sense embeddings suffered from either ambiguity of different-level embeddings or inefficient sense selection. The proposed modular framework, MUSE, implements flexible modules to optimize distinct mechanisms, achieving the first purely sense-level representation learning system with linear-time sense selection. We leverage reinforcement learning to enable joint training on the proposed modules, and introduce various exploration techniques on sense selection for better robustness. The experiments on benchmark data show that the proposed approach achieves the state-of-the-art performance on synonym selection as well as on contextual word similarities in terms of MaxSimC.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, deep learning methodologies have dominated several research areas in natural language processing (NLP), such as machine translation, language understanding, and dialogue systems. However, most of applications usually utilize word-level embeddings to obtain semantics. Considering that natural language is highly ambiguous, the standard word embeddings may suffer from polysemy issues.

Neelakantan et al. (2014)

pointed out that, due to triangle inequality in vector space, if one word has two different senses but is restricted to

one embedding, the sum of the distances between the word and its synonym in each sense would upper-bound the distance between the respective synonyms, which may be mutually irrelevant, in embedding space111. Due to the theoretical inability to account for polysemy using a single embedding representation per word, multi-sense word representations are proposed to address the ambiguity issue using multiple embedding representations for different senses in a word (Reisinger and Mooney, 2010; Huang et al., 2012).

This paper focuses on unsupervised learning from the unannotated corpus. There are two key mechanisms for a multi-sense word representation system in such scenario: 1) a sense selection (decoding) mechanism infers the most probable sense for a word given its context and 2) a sense representation mechanism learns to embed word senses in a continuous space.

Under this framework, prior work focused on designing a single model to deliver both mechanisms (Neelakantan et al., 2014; Li and Jurafsky, 2015; Qiu et al., 2016). However, the previously proposed models introduce side-effects: 1) mixing word-level and sense-level tokens achieves efficient sense selection but introduces ambiguous word-level tokens during the representation learning process (Neelakantan et al., 2014; Li and Jurafsky, 2015), and 2) pure sense-level tokens prevent ambiguity from word-level tokens but require exponential time complexity when decoding a sense sequence (Qiu et al., 2016).

Unlike the prior work, this paper proposes MUSE222The trained models and code are available at—a novel modularization framework incorporating sense selection and representation learning models, which implements flexible modules to optimize distinct mechanisms. Specifically, MUSE enables linear time sense identity decoding with a sense selection module and purely sense-level representation learning with a sense representation module.

With the modular design, we propose a novel joint learning algorithm on the modules by connecting to a reinforcement learning scenario, which achieves the following advantages. First, the decision making process under reinforcement learning better captures the sense selection mechanism than probabilistic and clustering methods. Second, our reinforcement learning algorithm realizes the first single objective function for modular unsupervised sense representation systems. Finally, we introduce various exploration techniques under reinforcement learning on sense selection to enhance robustness.

In summary, our contributions are five-fold:

  • MUSE is the first system that maintains purely sense-level representation learning with linear-time sense decoding.

  • We are among the first to leverage reinforcement learning to model the sense selection process in sense representations system.

  • We are among the first to propose a single objective for modularized unsupervised sense embedding learning.

  • We introduce a sense exploration mechanism for the sense selection module to achieve better flexibility and robustness.

  • Our experimental results show the state-of-the-art performance for synonym selection and contextual word similarities in terms of MaxSimC.

2 Related Work

There are three dominant types of approaches for learning multi-sense word representations in the literature: 1) clustering methods, 2) probabilistic modeling methods, and 3) lexical ontology based methods. Our reinforcement learning based approach can be loosely connected to clustering methods and probabilistic modeling methods.

Reisinger and Mooney (2010)

first proposed multi-sense word representations on the vector space based on clustering techniques. With the power of deep learning, some work exploited neural networks to learn embeddings with sense selection based on clustering 

(Huang et al., 2012; Neelakantan et al., 2014). Chen et al. (2014) replaced the clustering procedure with a word sense disambiguation model using WordNet (Miller, 1995). Kågebäck et al. (2015) and Vu and Parker (2016) further leveraged a weighting mechanism and interactive process in the clustering procedure. Moreover, Guo et al. (2014) leveraged bilingual resources for clustering. However, most of the above approaches separated the clustering procedure and the representation learning procedure without a joint objective, which may suffer from the error propagation issue. Instead, the proposed approach, MUSE, enables joint training on sense selection and representation learning.

Instead of clustering, probabilistic modeling methods have been applied for learning multi-sense embeddings in order to make the sense selection more flexible, where Tian et al. (2014) and Jauhar et al. (2015) conducted probabilistic modeling with EM training. Li and Jurafsky (2015) exploited Chinese Restaurant Process to infer the sense identity. Furthermore, Bartunov et al. (2016) developed a non-parametric Bayesian extension on the skip-gram model (Mikolov et al., 2013b). Despite reasonable modeling on sense selection, all above methods mixed word-level and sense-level tokens during representation learning—unable to conduct representation learning in the pure sense level due to the complicated computation in their EM algorithms.

Recently, Qiu et al. (2016) proposed an EM algorithm to learn purely sense-level representations, where the computational cost is high when decoding the sense identity sequence, because it takes exponential time to search all sense combination within a context window. Our modular design addresses such drawback, where the sense selection module decodes a sense sequence with linear-time complexity, while the sense representation module remains representation learning in the pure sense level.

Figure 1: The MUSE architecture with a 3-step learning algorithm: 1) collocation sampling, 2) sense selection for sense representation learning, and 3) optimizing sense selection with a reward signal from sense representation. Reward signal is only passed to the target word to stabilize model training due to directional architecture in the sense representation module.

Unlike a lot of relevant work that requires additional resources such as the lexical ontology (Pilehvar and Collier, 2016; Rothe and Schütze, 2015; Jauhar et al., 2015; Chen et al., 2015; Iacobacci et al., 2015) or bilingual data (Guo et al., 2014; Ettinger et al., 2016; Šuster et al., 2016), which may be unavailable in some language, our model can be trained using only an unlabeled corpus. Also, some prior work proposed to learn topical embeddings and word embeddings jointly in order to consider the contexts (Liu et al., 2015a, b), whereas this paper focuses on learning multi-sense word embeddings.

3 Proposed Approach: MUSE

This work proposes a framework to modularize two key mechanisms for multi-sense word representations: a sense selection module and a sense representation module. The sense selection module decides which sense to use given a text context, whereas the sense representation module learns meaningful representations based on its statistical characteristics. Unlike prior work that must suffer from either inefficient sense selection (Qiu et al., 2016) or coarse-grained representation learning (Neelakantan et al., 2014; Li and Jurafsky, 2015; Bartunov et al., 2016), the proposed modularized framework is capable of performing efficient sense selection and learning representations in pure sense level simultaneously.

To learn sense-level representations, a sense selection model should be first established for sense identity decoding. On the other hand, the sense embeddings should guide the sense selection model when decoding a sense identity sequence. Therefore, these two modules should be tangled. This indicates that a naive two-stage algorithm or two separate learning algorithms proposed by prior work are not optimal.

By connecting the proposed formulation with reinforcement learning literature, we design a novel joint training algorithm. Besides, taking advantage of the form of reinforcement learning, we are among the first to investigate various exploration techniques in sense selection for unsupervised sense embedding learning.

3.1 Model Architecture

Our model architecture is illustrated in Figure 1, where there are two modules in optimization.

3.1.1 Sense Selection Module

Formally speaking, given a corpus , vocabulary , and the -th word , we would like to find the most probable sense , where is the set of senses in word . Assuming that a word sense is determined by the local context, we exploit a local context for sense selection according to the Markov assumption, where is the size of a context window. Then we can either formulate a probabilistic policy

about sense selection or estimate the

individual likelihood for each sense identity.

To ensure efficiency, here we exploit a linear neural architecture that takes word-level input tokens and outputs sense-level identities. The architecture is similar to continuous bag-of-words (CBOW) (Mikolov et al., 2013a). Specifically, given a word embedding matrix , the local context can be modeled as the summation of word embeddings from its context

. The output can be formulated with a 3-mode tensor

, whose dimensions denote words, senses, and latent variables. Then we can model or correspondingly. Here we model as a categorical

distribution using a softmax layer:


On the other hand, the likelihood of selecting distinct sense identities, , is modeled as a Bernoulli

distribution with a sigmoid function



Different modeling approaches require different learning methods, especially for the unsupervised setting. We leave the corresponding learning algorithms in § 3.2. Finally, with a built sense selection module, we can apply any selection algorithm such as a greedy selection strategy to infer the sense identity given a word with its context .

We note that modularized model enables efficient sense selection by leveraging word-level tokens, while remaining purely sense-level tokens in the representation module. Specifically, if denotes , decoding words takes senses to be searched due to independent sense selection. The prior work using a single model with purely sense-level tokens (Qiu et al., 2016) requires exponential time to calculate the collocation energy for every possible combination of sense identities within a context window, , for a single target sense. Further, Qiu et al. (2016) took an additional sequence decoding step with quadratic time complexity , based on an exponential number in the base unit. It demonstrates the achievement about sense inference efficiency in our proposed model.

3.1.2 Sense Representation Module

The semantic representation learning is typically formulated as a maximum likelihood estimation (MLE) problem for collocation likelihood. In this paper, we use the skip-gram formulation (Mikolov et al., 2013b) considering that it requires less training time, where only two sense identities are required for stochastic training. Other popular candidates, like GloVe (Pennington et al., 2014) and CBOW (Mikolov et al., 2013a), require more sense identities to be selected as input and thus not suitable for our scenario. For example, GloVe (Pennington et al., 2014) takes computationally expensive collocation counting statistics for each token in a corpus as input, which requires sense selection for every occurrence of the target word across the whole corpus for a single optimization step.

To learn the representations, we first create input sense representation matrix and collocation estimation matrix as the learning targets. Given a target word and collocated word with corresponding local contexts, we map them to their sense identities as and by the sense selection module, and maximize the sense collocation log likelihood . A natural choice of the likelihood function is formulated as a categorical distribution over all possible collocated senses given the target sense :


Instead of enumerating all possible collocated senses which is computationally expensive, we use the skip-gram objective (4(Mikolov et al., 2013b) to approximate (3) as shown in the green block of Figure 1.


where is the distribution over all senses for negative samples. In our experiment with senses for word , we use word-level unigram as sense-level unigram for efficiency and the -th power trick in Mikolov et al. (2013b).

We note that our modular framework can easily maintain purely sense-level tokens with an arbitrary representation learning model. In contrast, most related work using probabilistic modeling (Tian et al., 2014; Jauhar et al., 2015; Li and Jurafsky, 2015; Bartunov et al., 2016) binded sense representations with the sense selection mechanism, so efficient sense selection by leveraging word-level tokens can be achieved only at the cost of mixing word-level and sense-level tokens in their representation learning process.

3.2 Learning

Without the supervised signal for the proposed modules, it is desirable to connect two modules in a way where they can improve each other by their own estimations. First, a trivial way is to forward the prediction of the sense selection module to the representation module. Then we cast the estimated collocation likelihood as a reward signal for the selected sense for effective learning.

To realize the above procedure, we cast the learning problem a one-step Markov decision process (MDP)

(Sutton and Barto, 1998), where the state, action, and reward correspond to context , sense , and collocation log likelihood , respectively. Based on different modeling methods ((1) or (2)) in the sense selection module, we connect the model to respective reinforcement learning algorithms to solve the MDP. Specifically, we refer (1) to policy distribution and refer (2) to Q-value estimation in the reinforcement learning literature.

The proposed MDP framework embodies several nuances of sense selection. First, the decision of a word sense is Markov: taking the whole corpus into consideration is not more helpful than a handful of necessary local contexts. Second, the decision making in MDP exploits a hard decision for selecting sense identity, which captures the sense selection process more naturally than a joint probability distribution among senses 

(Qiu et al., 2016). Finally, we exploit the reward mechanism in MDP to enable joint training: the estimation of sense representation is treated as a reward signal to guide sense selection. In contrast, the decision making under clustering (Huang et al., 2012; Neelakantan et al., 2014) considers the similarity within clusters instead of the outcome of a decision using a reward signal as MDP.

3.2.1 Policy Gradient Method

Because (1) fits a valid probability distribution, an intuitive optimization target is the expectation of resulting collocation likelihood among each sense. In addition, as the skip-gram formulation in (4) is unidirectional (), we perform one-side optimization for the target sense to stabilize model training333We observe about performance drop by optimizing input selection and output selection simultaneously.. That is, for the target word and the collocated word given respective contexts and (), we first draw a sense for from the policy and optimize the expected collocation likelihood for the target sense as follows,


Note that (4) can be merged into (5) as a single objective. The objective is differentiable and supports stochastic optimization (Lei et al., 2016), which uses a stochastic sample for optimization.

However, there are two possible disadvantages in this formulation. First, because the policy assumes the probability distribution in (1), optimizing the selected sense must affect the estimation of the other senses. Second, if applying stochastic gradient ascent to optimizing (5), it would always lower the probability estimation for the selected sense even if the model accurately selects the right sense. The detailed proof is in Appendix A.

3.2.2 Value-Based Method

To address the above issues, we apply the Q-learning algorithm (Mnih et al., 2013). Instead of maintaining a probabilistic policy for sense selection, Q-learning estimates the Q-value (resulting collocation log likelihood) for each sense candidate directly and independently. Thus, the estimation of unselected senses may not be influenced by the selected one. Note that in one-step MDP, the reward is equivalent to the Q-value, so we will use reward and Q-value interchangeably, hereinafter, based on the context.

We further follow the convention of recent neural reinforcement learning by reducing the reward range to aid model training (Mnih et al., 2013). Specifically, we replace the log likelihood with the likelihood as the reward function. Due to the monotonic operation in , the relative ordering of the reward remains the same.

Furthermore, we exploit the probabilistic nature of likelihood for Q-learning. To elaborate, as Q-learning is used to approximate the Q-value for each action in typical reinforcement learning, most literature adopted square loss to characterize the discrepancy between the target and estimated Q-values (Mnih et al., 2013). In our setting where the Q-value/reward is a likelihood function, our model exploits cross-entropy loss to better capture the characteristics of probability distribution.

Given that the collocation likelihood in (4) is an approximation to the original categorical distribution with a softmax function shown in (3) (Mikolov et al., 2013b), we revise the formulation by omitting the negative sampling term. The resulting formulation

is a Bernoulli distribution indicating whether

collocates or not given :


There are three advantages about using instead of approximated and original

. First, regarding the variance of estimation,

better captures than because involves sampling:


Second, regarding the relative ordering of estimation, for any two collocated senses and with a target sense , the following equivalence holds:

Third, for collocation computation, requires all sense identities and requires sense identities, whereas only requires sense identity. In sum, the proposed approximates with no variance, no “bias” (in terms of relative ordering), and significantly less computation.

Finally, because both target distribution and estimated distribution in (2) are Bernoulli distributions, we follow the last section to conduct one-side optimization by fixing a collocated sense and optimize the selected sense with cross entropy as


3.2.3 Joint Training

To jointly train sense selection and sense representation modules, we first select a pair of the collocated senses, and , based on the sense selection module with any selecting strategy (e.g. greedy), and then optimize the sense representation module and the sense selection module using the above derivations. Algorithm 1 describes the proposed MUSE model training procedure.

As modular frameworks, the major distinction between our modular framework and two-stage clustering-representation learning framework (Neelakantan et al., 2014; Vu and Parker, 2016) is that we establish a reward signal from the sense representation to the sense selection module to enable immediate and joint optimization.

for  do
       sample );
       optimize by (4) for the sense representation module;
       optimize by (5) or (9) for the sense selection module;
Algorithm 1 Learning Algorithm

3.3 Sense Selection Strategy

Given a fitness estimation for each sense, exploiting the greedy sense is the most popular strategy for clustering algorithms (Neelakantan et al., 2014; Kågebäck et al., 2015) and hard-EM algorithms (Qiu et al., 2016; Jauhar et al., 2015) in literature. However, there are two incentives to conduct exploration. First, in the early training stage when the fitness is not well estimated, it is desirable to explore underestimated senses. Second, due to high ambiguity in natural language, sometimes multiple senses in a word would fit in the same context. The dilemma between exploring sub-optimal choices and exploiting the optimal choice is called exploration-exploitation trade-off in reinforcement learning (Sutton and Barto, 1998).

We introduce exploration mechanisms for sense selection for both policy gradient and Q-learning. For policy gradient, we sample the policy distribution to approximate the expectation in (5). Because of the flexible formulation of Q-learning, the following classic exploration mechanisms are applied to sense selection:

  • Greedy: selects the sense with the largest Q-value (no exploration).

  • -Greedy: selects a random sense with probability, and adopts the greedy strategy otherwise (Mnih et al., 2013).

  • Boltzmann: samples the sense based on the Boltzmann distribution modeled by Q-value. We directly use (1) as the Boltzmann distribution for simplicity.

We note that Q-learning with Boltzmann sampling yields the same sampling process as policy gradient but different optimization objectives. To our best knowledge, we are among the first to explore several exploration strategies for unsupervised sense embedding learning.

In the following sections, MUSE-Policy denotes the proposed MUSE model with policy learning and MUSE-Greedy denotes the model using corresponding sense selection strategy for Q-learning.

4 Experiments

We evaluate our proposed MUSE model in both quantitative and qualitative experiments.

4.1 Experimental Setup

Our model is trained on the April 2010 Wikipedia dump (Shaoul and Westbury, 2010), which contains approximately 1 billion tokens. For fair comparison, we adopt the same vocabulary set as Huang et al. (2012) and Neelakantan et al. (2014). For preprocessing, we convert all words to their lower cases, apply the Stanford tokenizer and the Stanford sentence tokenizer (Manning et al., 2014), and remove all sentences with less than 10 tokens. The number of senses per word in is set to 3 as the prior work (Neelakantan et al., 2014).

In the experiments, the context window size is set to 5 (). Subsampling technique introduced by word2vec (Mikolov et al., 2013b) is applied to accelerate the training process. The learning rate is set to 0.025. The embedding dimension is 300. We initialize and as zeros, and and

from uniform distribution

such that each embedding has unit length in expectation (Lei et al., 2015). Our model uses 25 negative senses for negative sampling in (4). We use for -Greedy sense selection strategy

In optimization, we conduct mini-batch training with 2048 batch size using the following procedure: 1) select senses in the batch; 2) optimize using stochastic training within the batch for efficiency; 3) optimize using mini-batch training for robustness.

4.2 Experiment 1: Contextual Word Similarity

To evaluate the quality of the learned sense embeddings, we compute the similarity score between each word pair given their respective local contexts and compare with the human-judged score using Stanford’s Contextual Word Similarities (SCWS) dataset (Huang et al., 2012). Specifically, given a list of word pairs with corresponding contexts, , we calculate the Spearman’s rank correlation between human-judged similarity and model similarity estimations444For example, human-judged similarity between “… east bank of the Des Moines River …” and “… basis of all money laundering …” is 2.5 out of 10.0 in SCWS dataset (Huang et al., 2012).. Two major contextual similarity estimations are introduced by Reisinger and Mooney (2010): AvgSimC and MaxSimC. AvgSimC is a soft measurement that addresses the contextual information with a probability estimation:



refers to the cosine similarity between

and . AvgSimC weights the similarity measurement of each sense pair and by their probability estimations. On the other hand, MaxSimC is a hard measurement that only considers the most probable senses:

Method MaxSimC AvgSimC
Huang et al. (2012) 26.1 65.7
Neelakantan et al. (2014) 60.1 69.3
Tian et al. (2014) 63.6 65.4
Li and Jurafsky (2015) 66.6 66.8
Bartunov et al. (2016) 53.8 61.2
Qiu et al. (2016) 64.9 66.1
MUSE-Policy 66.1 67.4
MUSE-Greedy 66.3 68.3
MUSE--Greedy 67.4 68.6
MUSE-Boltzmann 67.9 68.7
Table 1: Spearman’s rank correlation x100 on the SCWS dataset. denotes superior performance to all unsupervised competitors.

The baselines for comparison include classic clustering methods (Huang et al., 2012; Neelakantan et al., 2014), EM algorithms (Tian et al., 2014; Qiu et al., 2016; Bartunov et al., 2016), and Chinese Restaurant Process (Li and Jurafsky, 2015)555We run Li and Jurafsky (2015)’s released code on our corpus for fair comparison., where all approaches are trained on the same corpus except Qiu et al. (2016) used more recent Wikipedia dumps. The embedding sizes of all baselines are 300, except 50 in Huang et al. (2012). For every competitor with multiple settings, we report the best performance in each similarity measurement setting and show in Table 1.

Method ESL-50 RD-300 TOEFL-80
1) Conventional Word Embedding
Global Context 47.73 45.07 60.87
Skip-Gram 52.08 55.66 66.67
2) Word Sense Disambiguation
IMS+SG 41.67 53.77 66.67
3) Unsupervised Sense Embeddings
EM 27.08 33.96 40.00
MSSG 57.14 58.93 78.26
CRP 50.00 55.36 82.61
MUSE-Policy 52.38 51.79 79.71
MUSE-Greedy 57.14 58.93 79.71
MUSE--Greedy 61.90 62.50 84.06
MUSE-Boltzmann 64.29 66.07 88.41
4) Supervised Sense Embeddings
Retro-GC 63.64 66.20 71.01
Retro-SG 56.25 65.09 73.33
Table 2: Accuracy on synonym selection. denotes superior performance to all unsupervised competitors.

Our MUSE model achieves the state-of-the-art performance on MaxSimC, demonstrating superior quality on independent sense embeddings. On the other hand, MUSE achieves comparable performance with the best competitor in terms of AvgSimC (68.7 vs. 69.3), while MUSE outperforms the same competitor significantly in terms of MaxSimC (67.9 vs. 60.1). The results demonstrate not only the high quality of sense representations but also accurate sense selection.

From the application perspective, MaxSimC refers to a typical scenario using single embedding per word, while AvgSimC employs multiple sense vectors simultaneously per word, which not only brings computational overhead but changes existing neural architecture for NLP. Hence, we argue that MaxSimC better characterize practical usage of a sense representation system than AvgSimC.

Among various learning methods for MUSE, policy gradient performs worst, echoing our argument in § 3.2.1. On the other hand, the superior performance of Boltzmann sampling and -Greedy over Greedy selection demonstrates the effectiveness of exploration.

Finally, replacing with as the reward signal yields times speedup for MUSE--Greedy and times speedup for MUSE-Boltzmann to reach 67.0 in MaxSimC, which demonstrates the efficacy of proposed approximation over typical in terms of convergence.

4.3 Experiment 2: Synonym Selection

We further evaluate our model on synonym selection using multi-sense word representations (Jauhar et al., 2015). Three standard synonym selection datasets, ESL-50 (Turney, 2001), RD-300 (Jarmasz and Szpakowicz, 2004), and TOEFL-80 (Landauer and Dumais, 1997), are performed. In the datasets, each question consists of a question word and four answer candidates , and the goal is to select the most semantically synonymous choice among the four candidates. For example, in the TOEFL-80 dataset, a question shows (Q) enormously, (A) appropriately, (B) uniquely, (C) tremendously, (D) decidedly, and the answer is (C). For multi-sense representations system, it selects the synonym of the question word using the maximum sense-level cosine similarity as a proxy of the semantic similarity (Jauhar et al., 2015).

Our model is compared with the following baselines: 1) conventional word embeddings: global context vectors (Huang et al., 2012) and skip-gram (Mikolov et al., 2013b); 2) applying supervised word sense disambiguation using the IMS system and then applying skip-gram on disambiguated corpus (IMS+SG) (Zhong and Ng, 2010); 3) unsupervised sense embeddings: EM algorithm (Jauhar et al., 2015), multi-sense skip-gram (MSSG) (Neelakantan et al., 2014), Chinese restaurant process (CRP) (Li and Jurafsky, 2015), and the MUSE models; 4) supervised sense embeddings with WordNet (Miller, 1995): retrofitting global context vectors (Retro-GC) and retrofitting skip-gram (Retro-SG) (Jauhar et al., 2015).

Among unsupervised sense embedding approaches, CRP and MSSG refer to the baselines with highest MaxSimC and AvgSimC in Table 1 respectively. Here we report the setting for baselines based on the best average performance in this task. We also show the performance of supervised sense embeddings as an upperbound of unsupervised methods due to the usage of additional supervised information from WordNet.

Context k-NN Senses
braves finish the season in tie with the los angeles dodgers scoreless otl shootout 6-6 hingis 3-3 7-7 0-0
his later years proudly wore tie with the chinese characters for pants trousers shirt juventus blazer socks anfield
of the mulberry or the blackberry and minos sent him to cranberries maple vaccinium apricot apple
of the large number of blackberry users in the us federal smartphones sap microsoft ipv6 smartphone
shells and/or high explosive squash head hesh and/or anti-tank venter thorax neck spear millimeters fusiform
head was shaven to prevent head lice serious threat back then shaved thatcher loki thorax mao luthor chest
appoint john pope republican as head of the new army of multi-party appoints unicameral beria appointed
Table 3: Different word senses are selected by MUSE according to different contexts. The respective k-NN (sorted by collocation likelihood) senses are shown to indicate respective semantic meanings.

The results are shown in Table 2, where our MUSE--Greedy and MUSE-Boltzmann significantly outperform all unsupervised sense embeddings methods, echoing the superior quality of our sense vectors in last section. MUSE-Boltzmann also outperforms the supervised sense embeddings except 1 setting without any supervised signal during training. Finally, the MUSE methods with proper exploration outperform all unsupervised baselines consistently, demonstrating the importance of exploration.

4.4 Qualitative Analysis

We further conduct qualitative analysis to check the semantic meanings of different senses learned by MUSE with k-nearest neighbors (k-NN) using sense representations. In addition, we provide contexts in the training corpus where the sense will be selected to validate the sense selection module. Table 3 shows the results. The learned sense embeddings of the words “tie”, “blackberry”, and “head” clearly correspond to correct senses under different contexts.

Since we address an unsupervised setting that learns sense embeddings from unannotated corpus, the discovered senses highly depend on the training corpus. From our manual inspection, it is common for our model to discover only two senses in a word, like “tie” and “blackberry”. However, we maintain our effort in developing unsupervised sense embeddings learning methods in this work, and the number of discovered sense is not a focus.

5 Conclusion

This paper proposes a novel modularized framework for unsupervised sense representation learning, which supports not only the flexible design of modular tasks but also joint optimization among modules. The proposed model is the first work that implements purely sense-level representation learning with linear-time sense selection, and achieves the state-of-the-art performance on benchmark contextual word similarity and synonym selection tasks. In the future, we plan to investigate reinforcement learning methods to incorporate multi-sense word representations for downstream NLP tasks.


We would like to thank reviewers for their insightful comments on the paper. The authors are supported by the Ministry of Science and Technology of Taiwan under the contract number 105-2218-E-002-033, Institute for Information Industry, and MediaTek Inc..


  • Bartunov et al. (2016) Sergey Bartunov, Dmitry Kondrashkin, Anton Osokin, and Dmitry Vetrov. 2016. Breaking sticks and ambiguities with adaptive skip-gram.

    Proceedings of the 19th International Conference on Artificial Intelligence and Statistics

    , page 130–138.
  • Chen et al. (2015) Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. 2015.

    Improving distributed representation of word sense via wordnet gloss composition and context clustering.

    Association for Computational Linguistics.
  • Chen et al. (2014) Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and disambiguation. In EMNLP, pages 1025–1035. Citeseer.
  • Ettinger et al. (2016) Allyson Ettinger, Philip Resnik, and Marine Carpuat. 2016. Retrofitting sense-specific word vectors using parallel text. In Proceedings of NAACL-HLT, pages 1378–1383.
  • Guo et al. (2014) Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Learning sense-specific word embeddings by exploiting bilingual resources. In COLING, pages 497–507.
  • Huang et al. (2012) Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. 2012. Improving Word Representations via Global Context and Multiple Word Prototypes. In Annual Meeting of the Association for Computational Linguistics (ACL).
  • Iacobacci et al. (2015) Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. Sensembed: Learning sense embeddings for word and relational similarity. In ACL (1), pages 95–105.
  • Jarmasz and Szpakowicz (2004) Mario Jarmasz and Stan Szpakowicz. 2004. Roget’s thesaurus and semantic similarity. Recent Advances in Natural Language Processing III: Selected Papers from RANLP, 2003:111.
  • Jauhar et al. (2015) Sujay Kumar Jauhar, Chris Dyer, and Eduard H Hovy. 2015. Ontologically grounded multi-sense representation learning for semantic vector space models. In HLT-NAACL, pages 683–693.
  • Kågebäck et al. (2015) Mikael Kågebäck, Fredrik Johansson, Richard Johansson, and Devdatt Dubhashi. 2015. Neural context embeddings for automatic discovery of word senses. In Proceedings of NAACL-HLT, pages 25–32.
  • Landauer and Dumais (1997) Thomas K Landauer and Susan T Dumais. 1997. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review, 104(2):211.
  • Lei et al. (2015) Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2015. Molding cnns for text: non-linear, non-consecutive convolutions. EMNLP.
  • Lei et al. (2016) Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predictions. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
  • Li and Jurafsky (2015) Jiwei Li and Dan Jurafsky. 2015. Do multi-sense embeddings improve natural language understanding? Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1722–1732.
  • Liu et al. (2015a) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2015a. Learning context-sensitive word embeddings with neural tensor skip-gram model. In IJCAI, pages 1284–1290.
  • Liu et al. (2015b) Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2015b. Topical word embeddings. In AAAI, pages 2418–2424.
  • Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60.
  • Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR.
  • Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Miller (1995) George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop.
  • Neelakantan et al. (2014) Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. 2014. Efficient non-parametric estimation of multiple embeddings per word in vector space. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. volume 14, pages 1532–1543.
  • Pilehvar and Collier (2016) Mohammad Taher Pilehvar and Nigel Collier. 2016. De-conflated semantic representations. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
  • Qiu et al. (2016) Lin Qiu, Kewei Tu, and Yong Yu. 2016. Context-dependent sense embedding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
  • Reisinger and Mooney (2010) Joseph Reisinger and Raymond J Mooney. 2010. Multi-prototype vector-space models of word meaning. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 109–117. Association for Computational Linguistics.
  • Rothe and Schütze (2015) Sascha Rothe and Hinrich Schütze. 2015. Autoextend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the ACL.
  • Shaoul and Westbury (2010) Cyrus Shaoul and Chris Westbury. 2010. The westbury lab wikipedia corpus.
  • Šuster et al. (2016) Simon Šuster, Ivan Titov, and Gertjan van Noord. 2016.

    Bilingual learning of multi-sense embeddings with discrete autoencoders.

    NAACL-HLT 2016.
  • Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
  • Tian et al. (2014) Fei Tian, Hanjun Dai, Jiang Bian, Bin Gao, Rui Zhang, Enhong Chen, and Tie-Yan Liu. 2014. A probabilistic model for learning multi-prototype word embeddings. In COLING, pages 151–160.
  • Turney (2001) Peter D Turney. 2001. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In

    European Conference on Machine Learning

    , pages 491–502. Springer.
  • Vu and Parker (2016) Thuy Vu and D Stott Parker. 2016. K-embeddings: Learning conceptual embeddings for words using context. In Proceedings of NAACL-HLT, pages 1262–1267.
  • Zhong and Ng (2010) Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: A wide-coverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 System Demonstrations, pages 78–83. Association for Computational Linguistics.

Appendix A Doubly Stochastic Gradient

To derive doubly stochastic gradient for equation (5), we first denote (5) as with and resolve the expectation form as:

Denote as the parameter set for policy . The gradient with respect to should be:

Accordingly, if we conduct typical stochastic gradient ascent training on with respect to from samples with a learning rate , the update formula will be:

However, the collocation log likelihood should always be non-positive: . Therefore, as long as the collocation log likelihood is negative, the update formula is to minimize the likelihood of choosing , despite the fact that may be good choices. On the other hand, if the log likelihood reaches , according to (4), it indicates:

which leads to computational overflow from an infinity value.