Using BERT for Word Sense Disambiguation

09/18/2019 ∙ by Jiaju Du, et al. ∙ Tsinghua University 0

Word Sense Disambiguation (WSD), which aims to identify the correct sense of a given polyseme, is a long-standing problem in NLP. In this paper, we propose to use BERT to extract better polyseme representations for WSD and explore several ways of combining BERT and the classifier. We also utilize sense definitions to train a unified classifier for all words, which enables the model to disambiguate unseen polysemes. Experiments show that our model achieves the state-of-the-art results on the standard English All-word WSD evaluation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ambiguity is common in natural language. Word Sense Disambiguation (WSD) deals with lexical ambiguity, i.e. polysemes in sentences. An effective WSD tool can benefit various downstream tasks, such as Information Retrieval Zhong and Ng (2012) and Machine Translation Neale et al. (2016).

Recently, the pre-trained language models, such as ELMo Peters et al. (2018), GPT Radford et al. (2018), and BERT Devlin et al. (2019), have been proven to be effective to extract features from plain text. They pre-train language models on large corpora, then add the pre-trained word representation into task-specific models, or directly fine-tune the language model on downstream tasks. Peters et al. (2018) tried to incorporate the pre-trained ELMo embeddings as WSD features, but there are currently no studies which fine-tune language models on WSD task.

There have been lots of works on WSD, and they can mainly be divided into two groups: supervised methods and knowledge-based methods. Supervised methods need large sense annotated corpora. They train a classifier using features extracted from the context of the polyseme. These methods can also be divided into two subgroups according to the features they use. Feature-based supervised methods use many conventional features like surrounding words, PoS tags of surrounding words, local word collections

Zhong and Ng (2010), word embeddings, and PoS tag embeddings Iacobacci et al. (2016)

. Neural-based supervised methods use a neural network encoder like BiLSTM to extract features

Melamud et al. (2016); Yuan et al. (2016); Raganato et al. (2017a).

Knowledge-based methods rely on the structure and content of knowledge bases, for instance, sense definitions Lesk (1986); Basile et al. (2014) and semantic networks which provide the relationship and similarity between two senses Agirre and Soroa (2009); Moro et al. (2014). Supervised methods perform better than knowledge-based methods Raganato et al. (2017b), but knowledge-based methods are usually unsupervised and require no sense annotated data. Some recent studies have explored ways of using knowledge like sense definitions to enhance supervised methods Luo et al. (2018b, a).

In this paper, we fine-tune BERT on the WSD task for the first time and compare the performance of different polyseme features output by the BERT encoder. The fine-tuned model beats baselines by a large margin. Many polysemes are rare, and the sense distribution of a polyseme is usually unbalanced. So the sense annotated corpora usually lack annotations for some polysemes or particular senses. To address this issue, we consider using sense definitions because we can obtain definitions for unseen sense from lexical databases easily. We find that the incorporation of sense definition improves the performance significantly.

Figure 1: Architecture of the Bert model. The polyseme “loveliest” is segmented into “love” and “##liest”.

The contributions of this paper are: (1) We fine-tune the BERT on WSD task and achieve the state-of-the-art results. (2) We prove that external knowledge is still useful for WSD with pre-trained language models and improves the performance.

2 Methodology

Given a sentence and some polysemes in the sentence, the WSD task aims to identify the correct senses of the polysemes in the sentence. Our model consists of an encoder and a classifier. The encoder extracts the polysemes’ features from sentences, and the classifier uses the features to predict senses. Figure 1 gives an overview of our model. We represent the sentence as a word sequence , and assume that is the polyseme. We denote the sense set of as . The definition of in the lexical databases is a word sequence . The hidden size of BERT is .

2.1 Encoder

We use BERT Devlin et al. (2019) as the encoder. BERT uses the WordPiece Wu et al. (2016) embeddings as a part of inputs. The WordPiece embeddings have a fixed size of vocabulary which includes some words and some word pieces, and segments out-of-vocabulary words. For example, the word “loveliest” is segmented into two word pieces “love” and “##liest”. Then a multi-layer Transformer reads the embeddings of the word piece sequence and outputs a hidden state for every token in the sequence.

As WSD is a single sentence tagging task, the most simple way of using BERT is to insert [CLS] at the start of the input sentence, and use the hidden state output by BERT to predict the label of the -th token. However, we cannot directly apply this simple method on WSD because some polysemes (about 15% in the training dataset) may be segmented into word pieces. These polysemes correspond to at least two final hidden states. We assume that the polyseme corresponds to a hidden state list

. To obtain a fixed-sized feature vector for the classifier, we average these hidden states:


is the feature used by the classifier. Max pooling is another way of merging these hidden states:

We can also get features by concatenating and the first final hidden state . corresponds to [CLS] and encodes global information of the sentence.

2.2 Classifier

We use a 2-layer MLP to predict the correct sense. The MLP outputs the sense distribution of the polyseme in the given context:

where are fully-connected linear layers, , and . The layer is specific for every polyseme.

This MLP classifier has poor performance on unseen or infrequent words and senses because of lack of data. So we introduce the sense definitions to address the issue of data scarcity. We use another BERT to encode the definition of a sense into its sense vector. Then this sense vector is used as the parameter when predicting this sense. Formally, we replace with , where is the concatenation of all sense vectors of the polyseme. The sense vectors map senses into a unified space and encode the semantic similarities into the distance between vectors. So the classification for unseen or infrequent senses can benefit from similar senses which have more training instances. We name the model without and with definitions Bert and Bert respectively.

In the training process, parameters are updated by minimizing the cross-entropy loss between the true label and the sense distribution :

where is the number of instances in the dataset, and is a one-hot vector which represents the true label of .

3 Experiments

3.1 Datasets

We evaluate our model on the English all-words tasks. We use the evaluation framework proposed by Raganato et al. (2017b). It provides five all-words fine-grained WSD datasets for evaluation: Senseval-2 (Edmonds and Cotton, 2001, SE2), Senseval-3 task 1 (Snyder and Palmer, 2004, SE3), SemEval-07 task 17 (Pradhan et al., 2007, SE7), SemEval-13 task 12 (Navigli et al., 2013, SE13), SemEval-15 task 13 (Moro and Navigli, 2015, SE15). Following previous works, we use SE7 as the validation dataset, and use the SE2, SE3, SE13, and SE15 as test datasets. The framework provides two annotated corpora for training: Semcor Miller et al. (1994) and OMSTI Taghipour and Ng (2015). We choose SemCor as our training dataset. Table 1 illustrates the statistics of these datasets. All of the datasets are annotated with WordNet Miller (1995) 3.0.

Training Validation Test
#Sentences 37,176 135 1,038
#Tokens 802,443 3,201 22,302
#Annotations 226,036 455 6,798
Ambiguity 6.8 8.5 5.7
Table 1: Statistics of the WSD training, validation, and test dataset. The “ambiguity” presents the average number of senses for instances in the dataset.

3.2 Experiment Setup

We use the BERT as the encoder. The number of Transformer layers is , the hidden layer size is , and the number of attention heads is . We tried to use the BERT, but the F1-score is nearly the same as BERT. We use Dropout in every layer of the classifier and the dropout rate is . The optimizer is Adam Kingma and Ba (2014). We reduce the learning rate during the training process. In the

-th epoch, the learning rate is

. The parameters of the BERT encoder are fixed in the first epochs. We train the model for epochs and choose the model which has the best F1-score on the validation set.

We compare our model with the following baselines: the simple MFS baseline which always outputs the most common sense in the training dataset, the knowledge-based methods Lesk Basile et al. (2014) and Babelfy Moro et al. (2014), the feature-based supervised methods IMS Zhong and Ng (2010) and IMS Iacobacci et al. (2016), and the neural based methods Bi-LSTM Kågebäck and Salomonsson (2016); Raganato et al. (2017a), GAS Luo et al. (2018b), CAN and HCAN Luo et al. (2018a).

3.3 English All-word Task Results

System Test Datasets Concatenation of Test Datasets All
SE2 SE3 SE13 SE15 Noun Verb Adj Adv
MFS Baseline 65.6 66.0 63.8 67.1 67.7 49.8 73.1 80.5 65.5
Lesk 63.0 63.7 66.2 64.6 70.0 51.1 51.7 80.6 64.2
Babelfy 67.0 63.5 66.4 70.3 68.9 50.7 73.2 79.8 66.4
IMS 70.9 69.3 65.3 69.5 70.5 55.8 75.6 82.9 68.9
IMS 72.2 70.4 65.9 71.5 71.9 56.6 75.9 84.7 70.1
Bi-LSTM 71.1 68.4 64.8 68.3 69.5 55.9 76.2 82.4 68.4
Bi-LSTM 72.0 69.1 66.9 71.5 71.5 57.5 75.0 83.8 69.9
GAS(Concatenation) 72.1 70.2 67.0 71.8 72.1 57.2 76.0 84.4 70.3
GAS(Concatenation) 72.2 70.5 67.2 72.6 72.2 57.7 76.6 85.0 70.6
CAN 72.3 69.8 65.5 71.1 71.1 57.3 76.5 84.7 69.8
CAN 72.2 70.2 69.1 72.2 73.5 56.5 76.6 80.3 70.9
HCAN 72.8 70.3 68.5 72.8 72.7 58.2 77.4 84.1 71.1
Bert 74.0 73.1 71.3 74.3 75.0 61.2 77.2 86.1 73.1
Bert 76.4 74.9 76.3 78.3 78.3 65.2 80.5 83.8 76.3
Table 2: F1-score(%) for the English all-word WSD evaluation. We report the F1-score of these systems on the four datasets (SE2, SE3, SE13, and SE15), on every part-of-speech type of polyseme (Noun, Verb, Adj, and Adv), and on the overall test dataset.

Table 2 shows the F1-score of our models and baselines on the standard English All-words WSD benchmark. The Bert and Bert in the table use the average operation to merge hidden states. They don’t use the sentence vectors. Our best model improves the state-of-the-art results by 5.2%, which indicates that BERT encoder is quite powerful in the WSD task. Moreover, our models outperform previous models on all of the four datasets and PoS types. Introducing sense definitions can significantly improve the performance of the Bert model. The F1-score of Bert is better than Bert on almost all datasets and PoS types except on the adverbs, which reveals the efficiency of introducing sense definitions.

3.4 Discussion

Word Count 0 1-10 11-50 51-200 200
Bert 82.15 73.19 74.04 71.36 68.20
Bert 90.15 79.94 75.56 73.57 67.65
#Words 650 1406 1976 1854 912
Ambiguity 1.59 2.75 4.65 7.15 12.23
Table 3: F1-score(%) on words with different occurrence numbers in the training dataset.

Word Frequency

We compare the performance of Bert and Bert on words with different occurrence numbers in the training dataset. Table 3 shows the results. Compared with the Bert model, Bert achieves the largest performance improvement on unseen words ( F1-score). The improvement decreases as the word becomes frequent because the Bert model can be trained better with more instances. So we can conclude that utilizing sense definitions can enhance the model on infrequent polysemes. In addition, both models have better performance on low-frequency words. The reason is that high-frequency words usually have many senses.

Ablation Study

We compare two variations of the BERT encoder: the way of merging hidden states of polysemes, and whether concatenating the hidden states of polysemes with the sentence vector. The results of these variations are presented in Table 4. All of the models are based on Bert. We can find that the performance of using average or max operation is nearly the same for Bert. But Concatenating the sentence vector will impair the F1-score. The reason is: the sentence vector contains the global semantic information of the entire sentence. The sense of polyseme is determined by its local context. So using the sentence vector brings too much irrelevant information.

SE2 SE3 SE13 SE15 All
Mean 76.4 74.9 76.3 78.3 76.3
Max 76.3 75.1 76.6 78.4 76.4
Mean 74.2 73.8 76.2 78.1 75.2
Max 73.4 73.7 76.1 76.7 74.8
Table 4: Ablation Study. Mean and Max denotes the average and max operation for merging hidden states of polysemes. Concat means concatenating the hidden states of polysemes with the sentence vector.

4 Conclusion

In this paper, we fine-tune the pre-trained language models like BERT on WSD tasks for the first time. We find that our BERT-based models achieve the state-of-the-art results on the standard evaluation. We also utilize sense definitions to enhance the model on infrequent polysemes. In the future works, we will consider using the relations between senses, like hypernym and hyponym, to provide more accurate sense representations.


  • E. Agirre and A. Soroa (2009) Personalizing pagerank for word sense disambiguation. In Proceedings of EACL, Cited by: §1.
  • P. Basile, A. Caputo, and G. Semeraro (2014) An enhanced lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of COLING, Cited by: §1, §3.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL, Cited by: §1, §2.1.
  • P. Edmonds and S. Cotton (2001) SENSEVAL-2: overview. In Proceedings of SENSEVAL, Cited by: §3.1.
  • I. Iacobacci, M. T. Pilehvar, and R. Navigli (2016) Embeddings for word sense disambiguation: an evaluation study. In Proceedings of ACL, Cited by: §1, §3.2.
  • M. Kågebäck and H. Salomonsson (2016) Word sense disambiguation using a bidirectional lstm. In

    Proceedings of the Workshop on Cognitive Aspects of the Lexicon

    Cited by: §3.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
  • M. Lesk (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proceedings of the International Conference on Systems Documentation, Cited by: §1.
  • F. Luo, T. Liu, Z. He, Q. Xia, Z. Sui, and B. Chang (2018a) Leveraging gloss knowledge in neural word sense disambiguation by hierarchical co-attention. In Proceedings of EMNLP, Cited by: §1, §3.2.
  • F. Luo, T. Liu, Q. Xia, B. Chang, and Z. Sui (2018b) Incorporating glosses into neural word sense disambiguation. In Proceedings of ACL, Cited by: §1, §3.2.
  • O. Melamud, J. Goldberger, and I. Dagan (2016) Context2vec: learning generic context embedding with bidirectional lstm. In Proceedings of SIGNLL, Cited by: §1.
  • G. A. Miller, M. Chodorow, S. Landes, C. Leacock, and R. G. Thomas (1994) Using a semantic concordance for sense identification. In Proceedings of the workshop on Human Language Technology, Cited by: §3.1.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM, pp. 39–41. Cited by: §3.1.
  • A. Moro and R. Navigli (2015) Semeval-2015 task 13: multilingual all-words sense disambiguation and entity linking. In Proceedings of SemEval, Cited by: §3.1.
  • A. Moro, A. Raganato, and R. Navigli (2014) Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics 2, pp. 231–244. Cited by: §1, §3.2.
  • R. Navigli, D. Jurgens, and D. Vannella (2013) Semeval-2013 task 12: multilingual word sense disambiguation. In Proceedings of SemEval, Cited by: §3.1.
  • S. Neale, L. Gomes, E. Agirre, O. L. de Lacalle, and A. Branco (2016) Word sense-aware machine translation: including senses as contextual features for improved translation models.. In Proceedings of LREC, Cited by: §1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of NAACL, Cited by: §1.
  • S. Pradhan, E. Loper, D. Dligach, and M. Palmer (2007) Semeval-2007 task-17: english lexical sample, srl and all words. In Proceedings of SemEval, Cited by: §3.1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • A. Raganato, C. D. Bovi, and R. Navigli (2017a) Neural sequence learning models for word sense disambiguation. In Proceedings of EMNLP, Cited by: §1, §3.2.
  • A. Raganato, J. Camacho-Collados, and R. Navigli (2017b) Word sense disambiguation: a unified evaluation framework and empirical comparison. In Proceedings of EACL, Cited by: §1, §3.1.
  • B. Snyder and M. Palmer (2004) The english all-words task. In Proceedings of SENSEVAL, Cited by: §3.1.
  • K. Taghipour and H. T. Ng (2015) One million sense-tagged instances for word sense disambiguation and induction. In Proceedings of CoNLL, Cited by: §3.1.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    arXiv preprint arXiv:1609.08144. Cited by: §2.1.
  • D. Yuan, J. Richardson, R. Doherty, C. Evans, and E. Altendorf (2016) Semi-supervised word sense disambiguation with neural models. In Proceedings of COLING, Cited by: §1.
  • Z. Zhong and H. T. Ng (2010) It makes sense: a wide-coverage word sense disambiguation system for free text. In Proceedings of ACL, Cited by: §1, §3.2.
  • Z. Zhong and H. T. Ng (2012) Word sense disambiguation improves information retrieval. In Proceedings of ACL, Cited by: §1.