Moving Down the Long Tail of Word Sense Disambiguation with Gloss-Informed Biencoders

by   Terra Blevins, et al.
University of Washington

A major obstacle in Word Sense Disambiguation (WSD) is that word senses are not uniformly distributed, causing existing models to generally perform poorly on senses that are either rare or unseen during training. We propose a bi-encoder model that independently embeds (1) the target word with its surrounding context and (2) the dictionary definition, or gloss, of each sense. The encoders are jointly optimized in the same representation space, so that sense disambiguation can be performed by finding the nearest sense embedding for each target word embedding. Our system outperforms previous state-of-the-art models on English all-words WSD; these gains predominantly come from improved performance on rare senses, leading to a 31.1 reduction on less frequent senses over prior work. This demonstrates that rare senses can be more effectively disambiguated by modeling their definitions.


page 1

page 2

page 3

page 4


BERT Has Uncommon Sense: Similarity Ranking for Word Sense BERTology

An important question concerning contextualized word embedding (CWE) mod...

Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories

Word Sense Disambiguation (WSD) aims to automatically identify the exact...

Evaluating a Multi-sense Definition Generation Model for Multiple Languages

Most prior work on definition modeling has not accounted for polysemy, o...

Probabilistic FastText for Multi-Sense Word Embeddings

We introduce Probabilistic FastText, a new model for word embeddings tha...

Sense Embedding Learning for Word Sense Induction

Conventional word sense induction (WSI) methods usually represent each i...

FEWS: Large-Scale, Low-Shot Word Sense Disambiguation with the Dictionary

Current models for Word Sense Disambiguation (WSD) struggle to disambigu...

You Shall Know the Most Frequent Sense by the Company it Keeps

Unsupervised identification of the most frequent sense of a polysemous w...

1 Introduction

One of the major challenges of Word Sense Disambiguation (WSD) is overcoming the data sparsity that stems from the Zipfian distribution of senses in natural language Kilgarriff (2004). For example, in SemCor (the largest manually annotated dataset for WSD) 90% of mentions of the word plant correspond to the top two senses of the word, and only half of the ten senses of plant occur in the dataset at all Miller et al. (1993). Due to this data imbalance, many WSD systems show a strong bias towards predicting the most frequent sense (MFS) of a word regardless of the surrounding context Postma et al. (2016).

A successful WSD system should be able to overcome this bias and correctly disambiguate cases where a word takes a less frequent sense (LFS), without sacrificing performance on MFS examples. Previous work has found that incorporating lexical information such as sense definitions, or glosses, into WSD systems improves performance Luo et al. (2018, 2018).111For example, in the sentence “She planted the tree,” the gloss, or meaning, for the sense of plant is “put or set [something] firmly into the ground.” Miller (1995) Glosses have also been found to improve LFS performance; however, absolute performance on rare senses is still low, with models showing a 62.3 F1 performance drop between the MFS examples and the LFS ones Kumar et al. (2019).

In this paper, we show that this gap can be significantly reduced by jointly fine-tuning multiple pretrained encoders on WSD. We present a bi-encoder model built on top of BERT Devlin et al. (2019) that is designed to improve performance on rare and zero-shot senses. Similar to prior work, our system represents the target words and senses in the same embedding space by using a context encoder to represent the target word and surrounding context, and a gloss encoder to represent the sense definitions. However, our two encoders are jointly learned from the WSD objective alone and trained in an end-to-end fashion.

This approach allows our model to outperform prior work on the English all-words WSD task introduced in Raganato et al. (2017b). Analysis of our model shows that these gains come almost entirely from better performance on the less frequent senses, with an 15.6 absolute improvement in F1 performance over the closest performing system; our model also improves on prior work in the zero-shot setting, where we evaluate performance on words and senses not seen during training.

Finally, we train our model in a few-shot setting in order to investigate how well the bi-encoder system learns on a limited set of training examples per sense. The bi-encoder architecture is able to generalize better from the limited number of examples than a strong pretrained baseline. This results demonstrates the data efficiency of our system and indicates why it captures LFS well, as less common senses naturally only have a few training examples in the data.

In summary, the overall contributions of this work are as follows:

  • [topsep=1pt,itemsep=-1pt,partopsep=4pt, parsep=4pt]

  • We present a jointly optimized bi-encoder model (BEM) for WSD that improves performance on all-words English WSD.

  • We show that our model’s improvements come from better performance on LFS and zero-shot examples, without sacrificing accuracy on the most common senses.

  • We examine why our model performs well on LFS with a number of experiments, including an evaluation of the BEM in a few-shot learning setting demonstrating that the bi-encoder generalizes well from limited data.

The source code and trained models for our WSD bi-encoders can be found at

Figure 1: Architecture of our bi-encoder model for WSD. The context sentence and sense gloss text are input into the context and gloss encoders, respectively; each encoder is initialized with BERT. We take the output of the context encoder as the representation for the target word ; the first output of the gloss encoder, which corresponds to the BERT-specific start token [CLS], is used as a representation for each candidate sense . is compared to with a dot product, and the sen se with the highest similarity to is assigned as the predicted label.

2 Background and Related Work

Word Sense Disambiguation (WSD) is the task of predicting the particular sense, or meaning, of a word when it occurs in a specific context Navigli (2009). Understanding what a word means in context is critical to many NLP tasks, and WSD has been shown to help downstream tasks such as machine translation (MT) Vickrey et al. (2005); Neale et al. (2016); Rios Gonzales et al. (2017) and information extraction (IE) Ciaramita and Altun (2006); Bovi et al. (2015).

The formulation of WSD that we address is all-words WSD, where the model disambiguates every ambiguous word in the data (e.g., Palmer et al. (2001); Moro and Navigli (2015)

). Many WSD systems approached this task with manually engineered features that were used to learn an independent classifier, or

word expert, for each ambiguous lemma Zhong and Ng (2010); Shen et al. (2013). Later work also integrated word embeddings into this independent classifier approach Rothe and Schütze (2015); Iacobacci et al. (2016).

Neural models for WSD built on this approach by training encoders for better feature extraction; they then either still learned independent classifiers on top of the encoded features

Kågebäck and Salomonsson (2016), or labeled each word using a shared output space Raganato et al. (2017a)

. Other neural approaches used semi-supervised learning to augment the learned representations with additional data

Melamud et al. (2016); Yuan et al. (2016).

2.1 Lexical Resources for WSD

Definitions of senses, or glosses, have been shown to be a valuable resource for improving WSD. Lesk (1986) used the overlap between the definitions of senses and the context of the target word to predict the target sense. This approach was later extended to incorporate WordNet graph structure Banerjee and Pedersen (2003) and to incorporate word embeddings Basile et al. (2014). More recently, Luo et al. (2018, 2018) added sense glosses as additional inputs into their neural WSD system, significantly improving overall performance.

Most similar to our work, Kumar et al. (2019)

represented senses as continuous representations learned from encoded glosses. However, they took a pipelined approach and supervised the gloss encoder with knowledge graph embeddings; they then froze the sense representations to use them as static supervision for training the WSD system. This approach requires an additional form of supervision (for which they used knowledge graph embeddings), making it more difficult to generalize to new data without that source of supervision. In comparison, our model is trained in an end-to-end manner and learns to embed gloss text without additional supervision.

Other work has shown that neural models capture useful semantic information about words from their definitions, and has used them to encode lexical representations Bahdanau et al. (2017); Bosc and Vincent (2018). While they focused on representing words, rather than specific senses, their modeling approaches could be extended to sense representations.

2.2 Pretrained NLP Models for WSD

Pretrained models have been shown to capture a surprising amount of word sense information from their pretraining objectives alone Peters et al. (2018); Stanovsky and Hopkins (2018); Coenen et al. (2019), allowing the frozen pretrained representations to compete with previous state-of-the-art WSD systems Hadiwinoto et al. (2019). Building on these findings, Vial et al. (2019) incorporates pretrained BERT representations as inputs into their WSD system, and Loureiro and Jorge (2019) uses BERT’s contextualized outputs to create sense embeddings for each sense in WordNet.

Another approach to using pretrained models for WSD is to formulate the task as a sentence-pair classification problem, in which (context sentence, gloss) pairs are concatenated and cross-encoded with the pretrained model. This reduces the WSD task to a binary classification problem where the model is trained to predict whether the gloss matches the sense of the target word in the context sentence Huang et al. (2019). Given that transformer compute scales polynomially in the input length, our approach of independently encoding the contexts and sense glosses is more computationally efficient, and we also show that it performs better on the all-words WSD task (Section 5.1).

3 Methodology

In this section, we present an approach for WSD that is designed to more accurately model less frequent senses by better leveraging the glosses that define them. The overall model architecture is shown in Figure 1. Our bi-encoder model (BEM) consists of two independent encoders: (1) a context encoder, which represents the target word (and its surrounding context) and (2) a gloss encoder

, that embeds the definition text for each word sense. These encoders are trained to embed each token near the representation of its correct word sense. Each encoder is a deep transformer network initialized with BERT, in order to leverage the word sense information it captures from pretraining

Coenen et al. (2019); Hadiwinoto et al. (2019). To describe our approach, we formally define the task of WSD (Section 3.1), and then present the BEM system in detail (Section 3.2).

3.1 Word Sense Disambiguation

Word Sense Disambiguation (WSD) is the task of assigning a sense to a target word, given its context. More formally, given a word and context , a WSD system is a function such that subject to , where is all possible candidate senses of .

We focus on the task of all-words WSD, in which every ambiguous word in a given context is disambiguated.222In practice, this means every content word – noun, verb, adjective, and adverb – in the context is disambiguated by the WSD system. In this setting, a WSD model is given as input and outputs a sequence of sense predictions , where the model predicts the , , and senses from the candidate sense sets for , , and , respectively. For our approach, we assume for each sense that we also have a gloss that defines .

3.2 Bi-encoder for WSD

Our bi-encoder architecture independently encodes target words (with their contexts) and sense glosses Bromley et al. (1994); Humeau et al. (2019)

. Each of these models are initialized with BERT-base: therefore, the inputs to each encoder are padded with BERT-specific start and end symbols: input

is modified to [CLS], [SEP].

The context encoder, which we define as , takes as input a context sentence containing a set of target words to be disambiguated, s.t. , where is the target word in the context sentence. The encoder then produces a sequence of representations , where

or the representation output by . For words that are tokenized into multiple subword pieces by the BERT tokenizer, we represent the word by the average representation of its subword pieces. For example, let the through tokens correspond to the subpieces of the word, we have

The gloss encoder, defined as , takes in a gloss that defines the sense as input. The gloss encoder represents as

where we take the first representation output by the gloss encoder (corresponding to the input [CLS] token) as a global representation for .

We then score each candidate sense for a target word by taking the dot product of against every for :

for . During evaluation, we predict the sense of the target word to be the sense whose representation has the highest dot product score with .

We use a cross-entropy loss on the scores for the candidate senses of the target word

to train our bi-encoder model; the loss function of our system given a (word, sense) pair


Dev Test Datasets Concatenation of all Datasets SE07 SE2 SE3 SE13 SE15 Nouns Verbs Adj. Adv. ALL Baseline Systems WordNet S1 55.2 66.8 66.2 63.0 67.8 67.6 50.3 74.3 80.9 65.2 MFS (in training data) 54.5 65.6 66.0 63.8 67.1 67.7 49.8 73.1 80.5 65.5 BERT-base 68.6 75.9 74.4 70.6 75.2 75.7 63.7 78.0 85.8 73.7 Prior Work HCAN - 72.8 70.3 68.5 72.8 72.7 58.2 77.4 84.1 71.1 EWISE 67.3 73.8 71.1 69.4 74.5 74.0 60.2 78.0 82.1 71.8 GLU 68.1 75.5 73.6 71.1 76.2 - - - - 74.1 LMMS 68.1 76.3 75.6 75.1 77.0 - - - - 75.4 SVC - - - - - - - - - 75.6 GlossBERT 72.5 77.7 75.2 76.1 80.4 79.8 67.1 79.6 87.4 77.0 BEM 74.5 79.4 77.4 79.7 81.7 81.4 68.5 83.0 87.9 79.0

Table 1: F1-score (%) on the English all-words WSD task. ALL is the concatenation of all datasets, including the development set SE07. We compare our bi-encoder model (BEM) against the WordNet S1 and most frequent sense (MFS) baselines, as well as a frozen BERT-base classifier and recent prior work on this task.

4 Experimental Setup

4.1 WSD Task and Datasets

We evaluate our BEM system with the WSD framework established in Raganato et al. (2017b). We train our model on SemCor, a large dataset manually annotated with senses from WordNet that contains 226,036 annotated examples covering 33,362 separate senses Miller et al. (1993). We use the SemEval-2007 (SE07) dataset as our development set Pradhan et al. (2007); we hold out Senseval-2 (SE2; Palmer et al. (2001)), Senseval-3 (SE3; Snyder and Palmer (2004)), SemEval-2013 (SE13; Navigli et al. (2013)), and SemEval-2015 (SE15; Moro and Navigli (2015)) as evaluation sets, following standard practice. All sense glosses used in our system are retrieved from WordNet 3.0 Miller (1995).

4.2 Baselines

We compare the BEM against a number of baseline systems. We first consider two knowledge-based baselines: WordNet S1, which labels each example with its first (most common) sense as specified in WordNet, and most frequent sense (MFS), which assigns each word the most frequent sense it occurs with in the training data.

We also compare against the pretrained model used to initialize our BEM system, BERT-base Devlin et al. (2019), by learning a linear classifier for WSD on top of frozen BERT representations output by the final layer. We find that fine-tuning BERT-base on WSD classification does not improve performance over the frozen model; this finding holds for each of the pretrained encoders we consider. Specific training details for the frozen BERT baseline are given in Section 4.3. Since this baseline uses a standard, discrete classification setup, it backs off to the WordNet S1 predictions for unseen words.

Finally, we compare performance to six recent state-of-the-art systems. The HCAN Luo et al. (2018) model incorporates sense glosses as additional inputs into a neural WSD classifier. The EWISE model pretrains a gloss encoder against graph embeddings before freezing the learned sense embeddings and training an LSTM encoder on the WSD task Kumar et al. (2019). Hadiwinoto et al. (2019) investigates different ways of using the (frozen) pretrained BERT model to perform WSD, with their GLU model performing best; Vial et al. (2019) used various sense vocabulary compression (SVC) approaches to improve WSD learning.333For this work, we report the best result from a comparable setting (i.e., from a single model on the same training data). The LMMS system performs k-NN on word representations produced BERT against a learned inventory of embeddings for WordNet senses Loureiro and Jorge (2019). GlossBERT fine-tunes BERT on WSD by jointly encoding the context sentences and glosses Huang et al. (2019); this approach relies on a single, cross-encoder model, rather than our more efficient bi-encoder approach to independently encode contexts and glosses.

4.3 Model Architecture and Optimization

Our pretrained baseline is learned using a single linear layer and softmax on the output of the final layer of the frozen BERT-base model. Similarly, each encoder in the bi-encoder model is initialized with BERT-base. We obtain representations from each encoder by taking the outputs from the final layer of each encoder, and we optimize the model with a cross-entropy loss on the dot product score of these representations.444We initialize the models with BERT-base due to better baseline performance on WSD than RoBERTa-base, see Section 6.1 for more details

Additional hyperparameter and optimization details are given in the supplementary materials.

5 Evaluation

We present a series of experiments to evaluate our bi-encoder WSD model. We first compare the BEM against several baselines and prior work on English all-words WSD (Section 5.1), and then evaluate performance on the most frequent (MFS), less frequent (LFS), and zero-shot examples (Section 5.2).

5.1 Overall Results

Table 1 shows overall F1 results on the English all-words WSD task Raganato et al. (2017b). Frozen BERT-base is a strong baseline, outperforming all of the prior work that does not incorporate pretraining into their systems (GAS, HCAN, and EWISE). The GLU and SVC systems, which use the representations learned by BERT without fine-tuning, both slightly outperform our pretrained baseline. GlossBERT achieves even better WSD performance by fine-tuning BERT with their cross-encoder approach.

However, we also find that our BEM achieves the best F1 score on the aggregated ALL evaluation set, outperforming all baselines and prior work by at least 2 F1 points. This improvement holds across all of the evaluation sets in the WSD evaluation framework as well as for each part-of-speech on which we perform WSD. Therefore, we see that although many of the prior approaches considered build on pretrained models, we empirically observe that our bi-encoder model is a particularly strong method for leveraging BERT.

MFS LFS Zero-shot
Words Senses
WordNet S1 100.0 0.0 84.9 53.9
BERT-base 94.9 37.0 84.9 53.6
EWISE 93.5 31.2 91.0 -
BEM 94.1 52.6 91.2 68.9
BEM-bal 89.5 57.0 91.9 71.8
Table 2: F1-score (%) on the MFS, LFS, and zero-shot subsets of the ALL evaluation set. Zero-shot examples are the words and senses (respectively) that do not occur in the training data. The balanced BEM system, BEM-bal, is considered in Section 6.2.
Model Ablation Dev F1
Full BEM 74.5 -
Frozen Context Encoder 70.1 -4.4
Frozen Gloss Encoder 68.1 -6.4
Tied Encoders 74.1 -0.4
Table 3: Ablations on the bi-encoder model (BEM). We consider the effect of freezing each of the two encoders and of tying the parameters of the encoders on development set performance.

5.2 Zero-shot and Rare Senses Results

To better understand these overall results, we break down performance across different sense frequencies. We split examples from the aggregated ALL evaluation set into mentions with the most frequent sense (MFS) of the target word and mentions that are labeled with the other, less frequent senses (LFS) of that word. We also consider zero-shot performance for both unseen words and unseen senses by evaluating performance on examples that are not observed during training. We compare our model against the frozen BERT-base baseline and EWISE Kumar et al. (2019), which also reported performance in these settings (Table 2).

BEM performs best on rare senses.

The vast majority of BEM’s gains comes from better performance on the LFS examples, leading to a 15.6 F1 improvement over the BERT baseline on the LFS subset. Despite this gain on less frequent senses, BEM remains (approximately) as accurate on the MFS examples as prior work and the BERT baseline. While we still see a large difference of 41.5 F1 points between the MFS and LFS examples with BEM, this is a strong improvement over both the BERT-base baseline and the EWISE system.

BEM shows competitive performance on unseen words.

Next we evaluated BEM on zero-shot words that did not occur in the training data. In this setting, WordNet S1 is a very strong baseline that achieves almost 85 F1 points from an untrained knowledge-based approach. Since the BERT-base model backs off to the WordNet S1 baseline for unseen words, it gets the same performance in this setting. The EWISE model from previous work, as well as our BEM, both outperform this baseline, with the BEM achieving a slightly higher F1 score for zero-shot words.

BEM generalizes well to embedding zero-shot senses.

The bi-encoder model allows us to predict senses that do not occur in the training set by embedding senses; this is a valuable modeling contribution since many senses do not occur in even the largest manually labeled WSD datasets. We therefore evaluate the BEM and baselines on zero-shot senses. The WordNet most common sense baseline remains strong, and the BERT baseline performs similarly to this WordNet S1 baseline. However, our bi-encoder model outperforms both baselines by at least 15 F1 points. This demonstrates that BEM is able to learn useful sense representations from the gloss text that are able to generalize well to unseen senses.

6 Analysis Experiments

In our model evaluation, we found that BEM outperforms prior work by improving disambiguation of less frequent senses while maintaining high performance on common ones. This section presents a series of analysis experiments in order to determine which aspects of the approach contribute to these improvements. In Section 6.1, we ablate different aspects of our model, and we consider the effect of balancing the training signal across senses with different frequencies in Section 6.2. Finally, we perform a qualitative analysis of the learned sense embedding space in Section 6.3.

Pretrained Model Dev F1
BERT-base 68.6
BERT-large 67.5
RoBERTa-base 68.1
RoBERTa-large 69.5
Table 4: Performance of various pretrained encoders on the WSD development set.
Figure 2: Representations for the word plant encoded by the frozen BERT-base encoder (left) and the context encoder of our BEM system (right); visualized with t-SNE. Sense glosses are from Wordnet Miller (1995).

6.1 Model Ablations

We ablate aspects of the bi-encoder model in order to see how they contribute to the overall performance; we consider freezing the context encoder, freezing the gloss encoder, and tying the two encoders so that they share the same parameters.

The results are shown in Table 3. A frozen gloss encoder hinders the system more than a frozen context encoder, implying that the gloss encoder needs to update the pretrained parameters more than the context encoder. We also see that while having independent encoders gives us the best performance, tying the parameters of the two encoder harms performance much less than freezing either of them. The tied encoder ablation leads to a 0.4 F1 point decrease on SemEval2007, and outperforms all prior models on this evaluation set despite having half the trainable parameters of the full BEM system.

Next, we consider how the choice of pretrained model affects WSD performance. Table 4 shows the performance of BERT-base and BERT-large Devlin et al. (2019) on the WSD SemEval2007 evaluation set, which is used as our development set; we also consider the WSD performance of RoBERTa-base and RoBERTa-large Liu et al. (2019). Similarly to the pretrained BERT-base baseline from previous section, we do not fine-tune the pretrained encoders, as we found that for all considered pretrained encoders that this did not improve performance over the frozen model.

Surprisingly, we see similar performance on the development set across all of the encoders we consider, despite the large pretrained models having twice as many parameters as the base models. Although RoBERTa-large does slightly outperform the BERT-base encoder, we initialize the BEM with BERT-base for better training efficiency.

6.2 Balancing the Senses

Despite the improvement on less common senses over baselines (Section 5.2), we still see a large performance gap between the MFS and LFS subsets. One possible explanation is data imbalance, since the MFS subset contains many more training examples. To control for this effect, we consider an additional training scheme for the bi-encoder model, in which we re-balance the training signal for each candidate sense of a target word. We do this by weighting the loss of each sense in the set of candidate senses of the target word by its inverse frequency in the training data. By doing this, we allow each sense to contribute equally to the training signal for .

This balanced BEM model achieves an F1 score of 77.6, underperforming the standard BEM on the aggregated ALL evaluation set. Table 2 shows the performance of the balanced BEM. By breaking down the balanced model performance, the balanced BEM outperforms the standard BEM on LFS examples, but suffers from worse performance on the more common MFS examples. We also find that this balancing during training slightly improves performance on both zero-shot words and senses.

These findings show that while weighting the data gives better signal for less common senses, it comes at the cost of the (sometimes helpful) data bias towards more frequent sense. This finding holds with similar results from Postma et al. (2016), although their experiments focused on altering the composition of the training data, rather than modifying the loss. One possible direction for future work is a more thorough investigation of methods for obtain a stronger training signal from less frequent senses, while still taking the MFS bias into account.

6.3 Visualizing Sense Embeddings

Finally, we explore the word representations learned by our bi-encoder model from fine-tuning on the WSD task. We perform a qualitative evaluation on the representations output by the BEM context encoder and compare these representations against those from the final layer of the frozen BERT-base encoder.

Figure 2 shows the outputs from each system on all instances of the word plant in the SemCor dataset. We see that BERT-base already learns some general groupings of the senses without any explicit word sense supervision; however, the sense clusters become much more concentrated in the bi-encoder model. We also see that the noun senses are better separated by the BEM than the verb senses (which all cluster near each other). This is most likely due to the limited training data for these verb senses compared to the much more common noun sense examples. We present additional visualizations of other ambiguous words in Appendix B.

7 Few-shot Learning of WSD

In this section, we investigate how efficient the BEM is in a few-shot learning setting, by limiting the number of training examples the model can observe per sense. We hypothesize that our model will be more efficient than a standard classifier for learning WSD, due to the additional information provided by the sense definitions.

In order to simulate a low-shot data setting, we create -shot training sets by filtering the SemCor data such that the filtered set contains up to examples of each sense in the full dataset; we then train the bi-encoder model using only this filtered training data. We train models on values of and compare their performance against the model trained on the full train set. We also retrain the frozen BERT-base classifier baseline for each considered. In order to keep training comparable across different amounts of training data, we train each few-shot BEM for the same number of training steps as the system trained on the full dataset (approximately 180,000 updates).

The results of this experiment are given in Figure 3. Unsurprisingly, both the frozen BERT classifier and the BEM achieve better F1 scores as we increase and train them on additional data. However, we see that the BEM is more efficient on smaller values of , with a much smaller drop off in performance at k=1 than the pretrained baseline. This efficiency also allows the BEM to achieve similar performance to the full baseline model with only 5 (or fewer) examples per sense.

The performance of these few-shot models gives us insight into the the kinds of data that could be used to improve WSD models. While it would be prohibitively difficult to annotate many examples for every sense considered by a WSD system, it is possible that augmenting existing WSD data to provide a few labeled examples of rare senses could be more effective than simply annotating more data without considering the sense distribution. These sorts of considerations are particularly important when extending the WSD task to new domains or languages, where a great deal of new data needs to be annotated; an important goal for these sorts of data augmentation is to make sure they allow for the efficient learning of all senses.

Figure 3: Performance of WSD models on the ALL evaluation set, trained in the few-shot setting across different values of and compared against the systems trained on the full training set ( = All).

8 Conclusion

In this work, we address the issue of WSD systems underperforming on uncommon senses of words. We present a bi-encoder model (BEM) that maps senses and ambiguous words into the same embedding space by jointly optimizing the context and glosses encoders. The BEM then disambiguates the sense of each word by assigning it the label of the nearest sense embedding. This approach leads to a 31.1% error reduction over prior work on the less frequent sense examples.

However, we still see a large gap in performance between MFS and LFS examples, with our model still performing over 40 points better on the MFS subset. Most recent WSD systems show a similar trend: even the representations of frozen BERT-base that are not fine-tuned on WSD can achieve over 94 F1 on examples labeled with the most frequent sense.

This leaves better disambiguation of less common senses as the main avenue for future work on WSD. Potential directions include finding ways to obtain more informative training signal from uncommon senses, such as with different approaches to loss reweighting, and exploring the effectiveness of other model architectures on LFS examples. Another direction for future work would improve few-shot approaches to WSD, which is both important for moving WSD into new domains and for modeling rare senses that naturally have less support in WSD data.


This material is based on work conducted at the University of Washington, which was supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-1762114. We thank Gabi Stanovsky, Ledell Wu, and the UW NLP group for helpful conversations and comments on the work.


  • D. Bahdanau, T. Bosc, S. Jastrzebski, E. Grefenstette, P. Vincent, and Y. Bengio (2017) Learning to compute word embeddings on the fly. arXiv preprint arXiv:1706.00286. Cited by: §2.1.
  • S. Banerjee and T. Pedersen (2003) Extended gloss overlaps as a measure of semantic relatedness. In IJCAI, Cited by: §2.1.
  • P. Basile, A. Caputo, and G. Semeraro (2014) An enhanced lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 1591–1600. Cited by: §2.1.
  • T. Bosc and P. Vincent (2018) Auto-encoding dictionary definitions into consistent word embeddings. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 1522–1532. Cited by: §2.1.
  • C. D. Bovi, L. E. Anke, and R. Navigli (2015) Knowledge base unification via sense embeddings and disambiguation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 726–736. Cited by: §2.
  • J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994)

    Signature verification using a” siamese” time delay neural network

    In Advances in neural information processing systems, pp. 737–744. Cited by: §3.2.
  • M. Ciaramita and Y. Altun (2006) Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, Stroudsburg, PA, USA, pp. 594–602. External Links: ISBN 1-932432-73-6, Link Cited by: §2.
  • A. Coenen, E. Reif, A. Yuan, B. Kim, A. Pearce, F. Viégas, and M. Wattenberg (2019) Visualizing and measuring the geometry of bert. arXiv preprint arXiv:1906.02715. Cited by: §2.2, §3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §4.2, §6.1.
  • C. Hadiwinoto, H. T. Ng, and W. C. Gan (2019) Improved word sense disambiguation using pre-trained contextualized word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5300–5309. Cited by: §2.2, §3, §4.2.
  • L. Huang, C. Sun, X. Qiu, and X. Huang (2019) GlossBERT: bert for word sense disambiguation with gloss knowledge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3500–3505. Cited by: §2.2, §4.2.
  • S. Humeau, K. Shuster, M. Lachaux, and J. Weston (2019) Poly-encoders: transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. arXiv preprint arXiv:1905.01969. Cited by: §3.2.
  • I. Iacobacci, M. T. Pilehvar, and R. Navigli (2016) Embeddings for word sense disambiguation: an evaluation study. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 897–907. Cited by: §2.
  • M. Kågebäck and H. Salomonsson (2016) Word sense disambiguation using a bidirectional lstm. arXiv preprint arXiv:1606.03568. Cited by: §2.
  • A. Kilgarriff (2004) How dominant is the commonest sense of a word?. In International conference on text, speech and dialogue, pp. 103–111. Cited by: §1.
  • D. P. Kingma and J. L. Ba (2015) Adam: A method for stochastic optimization. In Proceedings of 3rd International Conference of Learning Representations, External Links: Link Cited by: Appendix A.
  • S. Kumar, S. Jat, K. Saxena, and P. Talukdar (2019) Zero-shot word sense disambiguation using sense definition embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5670–5681. External Links: Link, Document Cited by: §1, §2.1, §4.2, §5.2.
  • M. Lesk (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation, pp. 24–26. Cited by: §2.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §6.1.
  • D. Loureiro and A. Jorge (2019) Language modelling makes sense: propagating representations through WordNet for full-coverage word sense disambiguation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5682–5691. External Links: Link, Document Cited by: §2.2, §4.2.
  • F. Luo, T. Liu, Z. He, Q. Xia, Z. Sui, and B. Chang (2018) Leveraging gloss knowledge in neural word sense disambiguation by hierarchical co-attention. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1402–1411. External Links: Link, Document Cited by: §1, §2.1, §4.2.
  • F. Luo, T. Liu, Q. Xia, B. Chang, and Z. Sui (2018) Incorporating glosses into neural word sense disambiguation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2473–2482. External Links: Link, Document Cited by: §1, §2.1.
  • O. Melamud, J. Goldberger, and I. Dagan (2016) Context2vec: learning generic context embedding with bidirectional lstm. In Proceedings of the 20th SIGNLL conference on computational natural language learning, pp. 51–61. Cited by: §2.
  • G. A. Miller, C. Leacock, R. Tengi, and R. T. Bunker (1993) A semantic concordance. In Proceedings of the workshop on Human Language Technology, pp. 303–308. Cited by: §1, §4.1.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: Appendix B, §4.1, Figure 2, footnote 1.
  • A. Moro and R. Navigli (2015) Semeval-2015 task 13: multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp. 288–297. Cited by: §2, §4.1.
  • R. Navigli, D. Jurgens, and D. Vannella (2013) Semeval-2013 task 12: multilingual word sense disambiguation. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pp. 222–231. Cited by: §4.1.
  • R. Navigli (2009) Word sense disambiguation: a survey. ACM computing surveys (CSUR) 41 (2), pp. 10. Cited by: §2.
  • S. Neale, L. Gomes, E. Agirre, O. L. de Lacalle, and A. Branco (2016) Word sense-aware machine translation: including senses as contextual features for improved translation models. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 2777–2783. External Links: Link Cited by: §2.
  • M. Palmer, C. Fellbaum, S. Cotton, L. Delfs, and H. T. Dang (2001) English tasks: all-words and verb lexical sample. In Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems, pp. 21–24. Cited by: §2, §4.1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the North American Association for Computational Linguistics (NAACL), Cited by: §2.2.
  • M. Postma, R. I. Bevia, and P. Vossen (2016) More is not always better: balancing sense distributions for all-words word sense disambiguation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 3496–3506. Cited by: §1, §6.2.
  • S. Pradhan, E. Loper, D. Dligach, and M. Palmer (2007) Semeval-2007 task-17: english lexical sample, srl and all words. In Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007), pp. 87–92. Cited by: §4.1.
  • A. Raganato, C. D. Bovi, and R. Navigli (2017a) Neural sequence learning models for word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1156–1167. Cited by: §2.
  • A. Raganato, J. Camacho-Collados, and R. Navigli (2017b) Word sense disambiguation: a unified evaluation framework and empirical comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 99–110. Cited by: §1, §4.1, §5.1.
  • A. Rios Gonzales, L. Mascarell, and R. Sennrich (2017)

    Improving word sense disambiguation in neural machine translation with sense embeddings

    In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, pp. 11–19. External Links: Link, Document Cited by: §2.
  • S. Rothe and H. Schütze (2015) AutoExtend: extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1793–1803. External Links: Link, Document Cited by: §2.
  • H. Shen, R. Bunescu, and R. Mihalcea (2013) Coarse to fine grained sense disambiguation in wikipedia. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, pp. 22–31. Cited by: §2.
  • B. Snyder and M. Palmer (2004) The english all-words task. In Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Cited by: §4.1.
  • G. Stanovsky and M. Hopkins (2018)

    Spot the odd man out: exploring the associative power of lexical resources

    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1533–1542. Cited by: §2.2.
  • L. Vial, B. Lecouteux, and D. Schwab (2019) Sense vocabulary compression through the semantic knowledge of wordnet for neural word sense disambiguation. In Proceedings of the 10th Global WordNet Conference (GWC), Cited by: §2.2, §4.2.
  • D. Vickrey, L. Biewald, M. Teyssier, and D. Koller (2005) Word-sense disambiguation for machine translation. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, Stroudsburg, PA, USA, pp. 771–778. External Links: Link, Document Cited by: §2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: Appendix A.
  • D. Yuan, J. Richardson, R. Doherty, C. Evans, and E. Altendorf (2016) Semi-supervised word sense disambiguation with neural models. arXiv preprint arXiv:1603.07012. Cited by: §2.
  • Z. Zhong and H. T. Ng (2010) It makes sense: a wide-coverage word sense disambiguation system for free text. In Proceedings of the ACL 2010 system demonstrations, pp. 78–83. Cited by: §2.

Appendix A Additional Training Details

Both our frozen BERT baseline and the BEM are implemented in PyTorch

555 and optimized with Adam Kingma and Ba (2015). The pretrained models used to initialize each model are obtained through Wolf et al. (2019); we initialize every model with the bert-base-uncased encoder.

BERT-base baseline.

The linear layer of the frozen BERT-base classifier is trained for 100 epochs, and tuned over the following parameter ranges: learning rates of

and batch sizes of .

Bi-encoder Model (BEM).

The BEM is trained for 20 epochs with a warmup phase of 10,000 steps. We use a context batch size of 4 and a gloss batch size of 256. The model is tuned on learning rates in . We use two GPUs to train the BEM, optimizing each encoder on a separate GPU to allow for larger batch sizes.

Appendix B Additional Sense Embedding Explorations

We present additional sense embedding space visualizations (Figures 4, 5, and 6). These visualizations are generated identically to the one discussed in Section 6.3. In each figure, the left visualization shows the representations output by a frozen BERT-base model, and the right one shows the output of our BEM’s context encoder. All figures are visualized with t-SNE. We choose words from SemCor that occur more than 50 times; for clarity, we limit the visualization to the six most common senses of each word. All senses and glosses are gathered from WordNet Miller (1995).

Figure 4: Visualization of learned representations for the word good. Overall the BEM (right) doesn’t improve on the frozen BERT representations (left), but we observe that the adj.2 sense is becoming better distinguished from adj.1 by the BEM, with the examples of adj.2 appearing only in one edge of the cluster for adj.1.
Figure 5: Visualization of learned representations for the word light. We see more distinct clusters forming in the representations from the BEM (right) than in the BERT-base outputs (left), though there is still overlap with the edges of the some groups.
Figure 6: Visualization of learned representations for the word run. We see that both the frozen BERT model (left) and BEM system (right) has difficulty distinguishing the verb.1 and verb.2 senses of run, which are closely related senses with a very fine-grained distinction (see glosses given in legend).