WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in Context

In this paper, we present WiC-TSV (Target Sense Verification for Words in Context), a new multi-domain evaluation benchmark for Word Sense Disambiguation (WSD) and Entity Linking (EL). Our benchmark is different from conventional WSD and EL benchmarks for it being independent of a general sense inventory, making it highly flexible for the evaluation of a diverse set of models and systems in different domains. WiC-TSV is split into three tasks (systems get hypernymy or definitional or both hypernymy and definitional information about the target sense). Test data is available in four domains: general (WordNet), computer science, cocktails and medical concepts. Results show that existing state-of-the-art language models such as BERT can achieve a high performance in both in-domain data and out-of-domain data, but they still have room for improvement. WiC-TSV task data is available at <https://competitions.codalab.org/competitions/23683>.



There are no comments yet.


page 1

page 2

page 3

page 4


Incorporating Word Sense Disambiguation in Neural Language Models

We present two supervised (pre-)training methods to incorporate gloss de...

XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization

The ability to correctly model distinct meanings of a word is crucial fo...

Adapting BERT for Word Sense Disambiguation with Gloss Selection Objective and Example Sentences

Domain adaptation or transfer learning using pre-trained language models...

Language models in word sense disambiguation for Polish

In the paper, we test two different approaches to the unsupervised word ...

A Quadratic 0-1 Programming Approach for Word Sense Disambiguation

Word Sense Disambiguation (WSD) is the task to determine the sense of an...

That's sick dude!: Automatic identification of word sense change across different timescales

In this paper, we propose an unsupervised method to identify noun sense ...

Training Verifiers to Solve Math Word Problems

State-of-the-art language models can match human performance on many tas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Word Sense Disambiguation (WSD) is a long-standing task in Natural Language Processing and Artificial Intelligence. While progress has been made in recent years, mainly thanks to the surge of transformer-based language models such as BERT

Loureiro and Jorge (2019); Vial et al. (2019); Huang et al. (2019), the evaluation of WSD models has been limited to a set of (mostly SemEval-based) standard WSD datasets Raganato et al. (2017); Vial et al. (2018). These datasets usually come in one of the two forms: lexical sample (in which a target word is placed in various contexts, triggering different meanings) and all-words (in which all the content words in a given text are to be disambiguated). However, these settings come with a major restriction: word senses in the datasets are linked to external sense inventories, usually WordNet Fellbaum (1998) in the case of WSD, or Wikipedia in the case of Entity Linking or Wikification. Therefore, the application of the framework is limited to only those WSD systems in which sense distinctions are defined according to underlying general sense inventory. This might not hold good for domain-based WSD systems which are designed for constrained settings and also carries the disadvantage that WordNet is limited in its coverage111The last update in WordNet dates back to June 2011. (missing many novel usages and domain-specific terms). The datasets are also not suitable for the evaluation of unsupervised or end-to-end disambiguation models that may induce their sense distinction without resorting to any external inventory. These models need to carry out an extra non-optimal step of mapping all induced senses to entries in the reference sense inventory (being WordNet, Wikipedia or other) in order to facilitate their evaluation on standard benchmarks.

We propose a new evaluation benchmark for Word Sense Disambiguation systems and contextualised word representation models. The benchmark draws ideas from the Word-in-Context benchmark (Pilehvar and Camacho-Collados, 2019, WiC), but provides a different evaluation setting with additional flavors. WiC-TSV has some key differences with WiC: (1) it provides a more realistic WSD setting in which a target ambiguous word is compared against its entry in an ontology (and not against another usage of the word in a different context); (2) the task is more targeted at word-level representation, as in one of the tasks (i.e. hypernymy task) the model is not provided with any contextual information and, therefore, needs to have a clear understanding of the word to be able to make correct judgements; and (3) it provides additional test sets for a constrained evaluation in realistic domain-specific settings. WiC-TSV also inherits some of the desirable properties of WiC, such as independence from external sense inventories and binary classification nature of the task.

The task statement of WiC-TSV resembles the usage of enterprise knowledge graphs

Galkin et al. (2017)

for entity linking. Typically, small domain specific enterprise knowledge graphs only contain entities from the domain of interest, partially or completely missing the general purpose senses of the contained labels. Therefore, it is important to tackle the entity linking task assuming incomplete information, i.e. only having information about a single sense. An automatic system able to efficiently solve the WiC-TSV challenge is expected to be useful for leveraging domain-specific vocabularies for tagging diverse corpora coming from different sources. Therefore, this system could be efficiently used in the scenarios of collecting and tagging large amounts of textual data, for example, from social media, news agencies, blogs, for further analysis tasks such as sentiment analysis, relation extraction, etc. All of these scenarios are extremely relevant for business in current information age.

2 Related Work

Word Sense Disambiguation and Entity Linking

The task of Word Sense Disambiguation (WSD) consists of associating a word in context with its more appropriate entry in a given sense inventory (e.g. WordNet). Similarly, in Entity Linking (EL) a system is generally required to identify, for an entity given a context, its most appropriate entry in a knowledge base or entity inventory (e.g. Wikipedia). For both WSD and EL there are many associated standard datasets Röder et al. (2018); Pasini and Camacho-Collados (2020). The main difference between our proposed benchmark and WSD/EL is that in our task there is no standard sense inventory that systems need to model in full. Each instance in our dataset is associated with a target word and single sense, and therefore systems are not required to model all senses of the target word, but rather only a single sense. This facilitates the development of systems for specific domains or settings, as no general-domain knowledge resource is required to perform this task. For instance, an Indonesian company may want to retrieve all sentences referring to the Java island and not other unrelated senses. This framing of the task is frequent in business and data mining settings where domain-specific knowledge resources or inventories may be available, without the need for modeling instances from other domains.

Word Sense Induction

Two tasks on word sense induction were offered as a part of SemEval: SemEval 2010 Task 14 Manandhar et al. (2010) and SemEval 2013 Task 13 Jurgens and Klapaftis (2013)

. These tasks offered an evaluation benchmark for systems that are able to induce sense inventories from unannotated text. In both tasks a large training dataset was provided for system to induce the unknown senses. After the induction step the systems tag each occurrence of the targets with an induced sense or senses. These datasets are similar to our proposed benchmark in that no explicit sense inventory is required before-hand. However, in our challenge not all the targets appear in the training set, therefore not allowing the annotation system to train a classifier per target. The system should be flexible and classify well the usages of unseen targets.


The closest task to ours is probably Word-in-Context

(Pilehvar and Camacho-Collados, 2019, WiC), which is where we base our WiC-TSV dataset. WiC222https://pilehvar.github.io/wic/ is a binary classification dataset where a target words is presented with two different contexts. The task consists of deciding whether the word is associated with the same sense in the two contexts or not. WordNet examples were used as the basis for the construction of this dataset. WiC is also one of the tasks included in the general language understanding framework SuperGLUE333https://super.gluebenchmark.com/ Wang et al. (2019). The main difference with respect to our dataset lies in the presence of relevant information such as hypernyms and definitions, which makes our dataset more realistic and a direct proxy for downstream evaluation: in WiC-TSV a single word is presented with its context and relevant information, in contrast to two usages of the same word included in the original WiC dataset. Moreover, WiC-TSV includes two domain-specific datasets (cocktails and medical entities), which makes the benchmark more challenging and comparable to a real setting.

3 WiC-TSV: The Benchmark

A goal of this benchmark is to enable the usage of domain-specific enterprise knowledge graphs for the processing of large volumes of diverse textual data. To this end, we constructed a benchmark satisfying following requirements:

  1. Only single sense of the target label is known;

  2. An enterprise knowledge graph is dynamic, i.e. new entities might be added and it might be necessary to classify usages of previously unseen words;

  3. Even domain specific knowledge graphs often contain certain general purpose entities, so it is necessary to disambiguate both general purpose and domain specific senses;

  4. Usually definitions, hypernyms or class assertions, possibly synonyms, are contained in the enterprise knowledge graph, therefore this information can be used for disambiguation.

Another model quality that is aimed at with the presented benchmark is the ability to transfer the intrinsic knowledge into a specific domain. As for most areas, domain specific training data is hard to obtain, being able to learn on general purpose data and still perform well on domain specific data is a huge advantage in a real world setting.

In order to address the aforementioned goals, our benchmark consists of a general purpose training and development set, and a test set which also contains domain specific examples. Development and test splits include both seen and unseen target words – simulating real-word scenarios where new entities might be added to the knowledge graph after the model was trained. Each instance consists of a target word with a corresponding target sense represented by either its definition (Task 1), or its hypernym/s (Task 2), or definition and hypenym/s (Task 3), and a context containing the target word . The task aims to determine whether the meaning of the word used in the context matches the target sense .

Table 1 contains examples of instances for all subsets available in the WiC-TSV test set. All WiC-TSV data and information on how to submit test results is available in CodaLab444competitions.codalab.org/competitions/23683. Furthermore, a small sample of 10 entities is available online in the form of a survey555https://www.surveymonkey.com/r/LHYWXPV, where the achieved score is shown to the user after the submission.

Tag Context Definition Hypernyms
General Purpose (WNT/WKT)
T Smoking is permitted . the act of smoking tobacco or other substances breathing, external respiration, respiration, ventilation
F all that work went down the sewer someone who sews needleworker
Cocktails (CTL)
T We were 11 at table for this feast . We started the evening with Bellini , made with fresh , Niagara peaches . ( Thank you , Jack Lalanne Juicer ! ) A Bellini cocktail is a mixture of Prosecco sparkling wine and peach purée. Originating in Venice, it is one of Italy’s most popular long drinks. cocktail
F After a morning ’s work I went off to see the Bellini retrospective at the Quirinale . Beautiful ! A Bellini cocktail is a mixture of Prosecco sparkling wine and peach purée. Originating in Venice, it is one of Italy’s most popular long drinks. cocktail
Medical Subjects (MSH)
T Italy now reports the second highest number of corona cases wordlwide . A viral disorder characterized by high fever; cough; dyspnea; renal dysfunction and other symptoms of a viral pneumonia. A coronavirus sars-CoV-2 in the genus betacoronavirus is the suspected agent. pneumonia; viral_pneumonia; coronavirus_infection
F Corona Labs is happy to announce the general availability of the public beta of Android 64-bit Corona builds . A viral disorder characterized by high fever; cough; dyspnea; renal dysfunction and other symptoms of a viral pneumonia. A coronavirus sars-CoV-2 in the genus betacoronavirus is the suspected agent. pneumonia; viral_pneumonia; coronavirus_infection
Computer Science (CPS)

pandas is a fast , powerful , flexible and easy to use open source data analysis and manipulation tool , built on top of the

Python programming language .
Python is an interpreted, high-level, general-purpose programming language object_oriented_programming_language
F The present paper compares the recently studied pythons with those examined 20 years ago , and uses the combined dataset to assess the ecological sustainability . Python is an interpreted, high-level, general-purpose programming language object_oriented_programming_language
Table 1: Sample instances from the three datasets of WiC-TSV. Target words are marked in bold. Tags: T (True) and F (False).

3.1 Dataset Construction

In this section we detail the construction of the datasets. First, we describe the construction of the training and development set (Section 3.1.1) and then the test set, with a special focus on the creation of the domain-specific sub-sets (Section 3.1.2).

3.1.1 Training and Development Set

Examples in the training and development set do not focus on a specific domain. As basis served the Word-in-Context (WiC) dataset Pilehvar and Camacho-Collados (2019), which contains a target word and two contexts and for each instance. The contexts from WiC for noun instances come from two resources, WordNet (our base resource) and Wiktionary. To maintain the desirable characteristics of the WiC dataset (e.g. balanced data, not having repeated contextual sentences across instances), the splits of the original training and development sets were treated separately in the following way: starting from a noun-only sub-sample, for each context, the conceptual meaning of the target word in context was mapped to the corresponding synset of WordNet, adding a sense identifier. The instances then were transformed to provide only one context each, by splitting the examples by context. For initial negative examples (i.e. the word has different meanings in and ), the sense identifiers were switched. To avoid information leakage, only one of the two resulting instances were kept. Finally, for each sense, the definitionand hypernym/s (both derived from WordNet) were provided as the target word relevant information for disambiguation.

3.1.2 Test Sets

To make the dataset more challenging and realistic, the test set incorporates both general purpose and domain-specific examples.

General Purpose (WNT/WTN)

The general purpose examples were generated analogously to 3.1.1. Hence, this test set is composed of both WordNet and Wiktionary examples.

In the following we describe the construction of the domain-specific test sets, whose examples started by choosing appropriate target words which could occur as concepts in an enterprise knowledge graph.

Cocktails (CTL)

For the cocktails examples the target words were taken from the “All about cocktails” thesaurus666vocabulary.semantic-web.at/cocktails (visited on 05.03.2019). The thesaurus contains 300 concepts describing not only cocktails, but also beverages, garnishes and glassware, among others. We have only used ambiguous cocktail names for this experiment. In this case, the hypernym representing the target sense for each cocktail name is “cocktail”, while the definition is derived from the thesaurus.

Medical Subjects (MSH)

For medical subject examples we use concepts, definitons and hypernyms from the MeSH thesaurus777www.nlm.nih.gov/mesh/ (visited on 27.03.2020). This thesaurus contains medical entities. We considered names of different types of entities such as diseases, symptoms and body parts as target words.

Computer Science (CPS)

Target words in the domain of computer science were gathered manually, without an readily available thesaurus. The definitions were derived using the lead section of the corresponding Wikipedia page, while hypernyms were created by the consensus of two domain experts.

In order to collect the context usages of the ambiguous words for CTL and MSH we used the Wikilinks dataset Singh et al. (2012). This dataset contains documents – webpages scraped from the web, including many blog posts – and the links from these documents to the Wikipedia pages. We collected the documents that mention the target words in different meanings. We removed all the documents that refer to the disambiguation page at Wikipedia and all the target words that were not mentioned with their domain-specific target sense, i.e. either as a cocktail, as a medical entity or as a computer science entity. Then, we identified the occurrences of the target words in the collected documents and extracted the local context around each occurrence. All contexts and definitions from the target sense of each instance are tokenized.

The contexts for CPS were collected manually. As the first step a list of ambiguous words was fixed. Then a search engine was used to find contexts for each ambiguous word. The senses were assigned manually.

Training Development Testing
Total Total Total
General Purpose WNT/WKT 2922 1551 0.51 440 427 0.50 717 698 0.54
Medical Subjects MSH - - - - - - 205 8 0.52
Cocktails CTL - - - - - - 216 9 0.43
Computer Science CPS - - - - - - 168 8 0.46
All domains MSH+CTL+CPS - - - - - - 589 25 0.47
Total 2922 1551 0.51 440 427 0.50 1306 723 0.51
Table 2: Statistics of training, development and testing splits of WiC-TSV, including total number of instances (Total), unique number of target words () and percentage of positive examples ().

After pre-processing, the datasets were checked manually to remove non-suitable and unsolvable examples. To maintain a rather realistic evaluation setup, data was not completely cleaned, meaning that contexts can contain noisy elements such as headings or meta-info derived from the websites (e.g., “posted by:”). The main difference between domain-specific and WNT/WKT test sets is that in the former the target sense of target words does not change, but in WNT/WKT there might be multiple target senses. For instance, in WNT/WKT, there might be “iris” with multiple intended meanings, but in MSH there is only one target meaning considered.

3.2 Data Cleaning

While the quality of the domain specific examples is assured due to their manual creation process, an additional data cleaning step in which general purpose examples were manually curated was introduced. The examples from the test set were split into 4 sets with an overlap of 20%. Each set was evaluated by an annotator regarding correctness and solvability of the examples. For example, when the hypernym of an instance was too generic to help in the disambiguation process or the context itself was too ambiguous, the instance was marked as ’to filter out’. Each marked example was reviewed by a second annotator, who could either confirm, or reject the request of removal. Examples, where all annotators reviewing it agreed on too poor quality, were removed.

An example of such removed instance would be the context ’The zero sign in American Sign Language is considered rude in some cultures .’ for the target word ’zero’ with the target definition ’a mathematical element that when added to another number yields the same number. ’Zero sign’ is used to describe the OK gesture, in which the ’zero’ refers to a ring formed with your thumb and pointing finger (which does not match the target sense), but on the other hand the ’zero sign’ also refers to the sign of the digit zero in American Sign Language (which does match the target sense). Other examples of filtered instances involve sentences where the target word may have been used metaphorically.

This procedure resulted in 106 examples which were removed. About 8% of these examples were part of evaluation sets created to measure the human performance (see 3.4)888Annotations for these examples were removed before calculating the metrics presented in 3.4: Native speakers achieved a mean accuracy of only 56% on these instances, the performance of non-native speakers was with 44% even below chance. This shows that the data cleaning step was necessary in order to ensure the data quality of the test set.

3.3 Statistics

A statistical overview over the dataset and their splits is shown in Table 2. The totality of 4668 available examples were split into train, development and test sets with a ratio of 63:9:28 which allows a sophisticated analysis of the generalisation capabilities of tested systems, while still providing an appropriately sized training set.

The test set contains around 55% general purpose examples and 45% from specific domains. For each domain, the number of unique senses is relatively low compared to the general domain subset, which results in a higher number of examples per target sense. For all three splits, positive and negative examples are approximately balanced.

3.4 Human Performance and Inter-Annotator Agreement

To estimate the human performance upper bound, a sub-sample of the test set was manually annotated. The performance was evaluated on the setting of Task 3, meaning that both the definition and the hypernyms were provided to disambiguate. A random selection of 250 examples were split into two evaluation sets of the size of 150, resulting in a 20% overlap. Each evaluation set was assigned to a non-expert annotator with English as native language. No additional information - especially not from the respective ontology or about other senses of the target - was provided to the annotators and they were instructed not to use external knowledge sources (e.g. if they are not familiar with the domain specific sense of a word).

Results of the human performance evaluation and inter-rater agreement can be fount in table 3. The mean performance for the evaluated datasets was 85%, with individual scores of 81% and 89%. To estimate the inter-annotator reliability, the agreement of the two annotators on the overlapping examples was calculated: For 42 examples (84%) the annotators agreed on the label.

When evaluating the examples per domain, it can be seen that the general purpose examples were more difficult than the domain specific ones, as annotators achieved an average performance of 82% (individual scores of 77% and 87%) on the general purpose instances, while the mean accuracy on the domains were 89% (83% and 96%), 92% (88% and 96%), and 86% (89% and 84%) for MSH, CTL, and CPS, respectively.

Influence of Native Language

In order to investigate the influence of English proficiency on the performance, a second evaluation with non-native speakers was performed. Herefore, the test set was split up into the different domains, and overlapping subsets were created, containing about 100 examples for general purpose (6 sets in total) and 35-45 examples for the domains (2 sets each).

All outcomes are summarised in table 3. It can be seen that while the mean accuracy achieved by native speakers and non-native speakers is comparable on the domain specific examples, their performance differs on the general purpose examples. While for both groups it can be seen that the average performance on the general purpose instances is worse than for other subsets, the performance difference is statistically significant only for non-natives. An explanation for this observation could be varying distances between the senses of a word. As the differences between two synsets in WordNet can be quite subtle, or fine-grained, expert knowledge or access to all different senses could be necessary in order to solve these examples. Instances from a specific domain on the other hand are usually more coarse-grained, reducing the necessity of expertise. Language knowledge seems to have an amplifying effect in this regard.

Native Non-Native
WNT/WKT 82.1 76.6
MSH 89.1 89.3
CTL 92.0 90.4
CPS 86.5 89.7
Total 85.3 82.3
Table 3: Comparison of the average accuracy of human annotators with (Native) and without (Non-Native) English as mother tongue. The performance was calculated for the general purpose subset (WNT/WKT) and the domain specific subsets (MSH, CTL, CPS).

4 Experimental Results

In this section we evaluate the performance of standard models in our WiC-TSV benchmark. For our experiments we considered two main baseline systems999Upon acceptance we will release the code in order to facilitate replication of the results and future comparison to our baselines. namely BERT Devlin et al. (2019) and FastText Joulin et al. (2017). Each of the baselines was adapted to the corresponding tasks in WiC-TSV.

4.1 Evaluation Tasks

The benchmark is split into three different tasks that we detail below: definition-based (Section 4.1.1), hypernym-based (Section 4.1.2) and both (Section 4.1.3).

4.1.1 Task 1: Definition Information

In this task, the goal is to identify if the target the intended meaning of the target word in the corresponding context matches that described by the definition (cf. Table 1 for examples). In other words, the model has to check if the concept represented by the definition can fit within the given context sentence. For this task, the system is provided with a context sentence (in which the target word is marked) along with a definition sentence (which describes one of the possible meanings of the word).


The first baseline is based on the pre-trained transformer-based language model BERT101010We used the implementation of BERT available at https://github.com/CyberZHG/keras-bert with the base pretrained model.

. It consists of a simple classification layer on top of the BERT model which is responsible for encoding the input. For this task, we concatenate the context and the definition sentences together and feed the whole sequence to BERT. Then, the classification layer takes as input the concatenation of three different vectors, all provided by BERT: the

[CLS] token representation, the representation of the target word in the context sentence and the average representation of the words in the definition sentence. This is similar to the baseline BERT model employed in SuperGLUE Wang et al. (2019). It is worth mentioning that BERT is originally trained using WordPiece tokenization Wu et al. (2016), which means that each word can be broken down into more than one sub-word. Therefore, in order to have a fixed length representation for each word, we take the average of its sub-word representations. Finally, the whole model is fine-tuned on the training set. For the Fasttext-based baseline, we first extract the corresponding embeddings for each word in the context and definition sentences. Then, the sentence representation is simply computed as the average of the corresponding embeddings it contains. Next, these two sentence representations are concatenated together to form a fixed length vector which we then feed to a fully connected layer. Finally, we put a simple classification layer on top of this fully connected layer and train the model on the training set.

4.1.2 Task 2: Hypernym Information

For this task, the system is provided with a target word (in a context) and a set of hypernyms for one of its senses. The task is to identify if the triggered meaning is the hyponym of the provided hypernym. In other words, the goal is to check if the intended meaning is the one characterized by the corresponding hypernym. Note that, unlike Task 1, no definition sentence is involved in this setting and the task is directed only on the merits of the hypernymy information.


We used very similar baseline models to those used in the previous task. The only difference lies in how we shape the inputs fed to these models. For the BERT-based model, we put together the context sentence with the hypernyms to form the input. Similarly, for the Fasttext-based model, the hypernym’s embedding is concatenated with the context’s representation and fed to the classifier.

4.1.3 Task 3: Both Definition and Hypernym Information

In the final third tasks systems are provided with both definition and hypernymy information. This task resembles a setting where both hypernymy and definition information are available and combining both sources of information is therefore desirable.


For this task, we concatenate the definition sentences and the hypernyms together, and feed the generated sequence together with the context sentence to BERT. Then, the concatenation of the [CLS] token representation, the representation of the target word in the context sentence and the average representation of the words in the definition/hypernyms sequence is fed to the classification layer. For the FastText-based baseline, the hypernym’s embedding is concatenated with both the context’s representation and the definition sentence representation and the combination is fed to the MLP classifier with a single hidden layer.

4.2 Results

Acc Prec      Rec       F1
Task-1 BERT 75.3 71.7      84.9     77.7
FastText 53.7 54.1      57.6     55.7
Task-2 BERT 71.4 67.7      83.5     74.8
FastText 52.7 52.4      73.6     61.1
Task-3 BERT 76.6 74.1      82.8     78.2
FastText 53.4 52.8      79.4     63.4
BaselineTrue 50.8 50.8      100      67.3
Human 85.3 80.2     96.2      87.4
Table 4: Performance (mean of three runs) of the two baseline models on the WiC-TSV test set, in terms of accuracy, precision, recall, and F1. Results are shown for three tasks: Task-1 (definition-based), Task-2 (hypernym-based), and Task-3 (both sources of information). BaselineTrue is a naive baseline that always returns “True” and the human performance is computed as described in Section 3.4.
Acc P         R         F1 Acc P         R         F1 Acc P         R         F1 Acc P         R         F1
T1 BERT 73.3 74.0      77.7     75.8 76.2 65.1      98.9     78.4 77.6 73.4      89.0     80.4 80.0 70.5      97.9     81.9
FastText 56.2 58.9      61.9     60.3 49.8 39.0      30.8     34.3 51.7 52.2      79.2     62.9 50.4 45.6      38.5     41.6
T2 BERT 68.6 70.0      72.9     71.4 77.9 66.6      97.8     79.3 71.9 65.1      98.4     78.3 74.4 64.7      98.7     78.2
FastText 56.8 58.9      66.3     62.1 43.1 43.0      99.3     60.0 49.1 50.4      84.0     62.9 52.0 48.8      65.0     55.3
T3 BERT 73.5 76.1      74.2     75.1 79.2 67.8      98.2     80.2 79.8 75.8      89.6     82.1 82.1 73.0      97.9     83.6
FastText 57.1 58.0      74.0     65.0 43.1 43.1      100.0     60.2 51.1 51.5      90.3     65.6 54.0 50.5      67.1     57.3
BaselineTrue 53.8 53.8      100     70.0 43.1 43.1      100     60.2 51.7 51.7      100     68.2 46.4 46.4      100     63.4
Table 5: Performance (mean of three runs) for the two baseline models for the three tasks (i.e., T1: definition-based, T2: hypernymy-based, and T3: both sources of information). Results are reported for the four domains: General (WNT/WKT), Cocktails (CTL), Medical Subjects (MSH), and Computer Science (CPS). BaselineTrue is a naive baseline that always returns ”True”. Human performance in terms of accuracy is estimated to be 82.1% (WNT/WKT), 92.0% (CTL), 89.1% (MSH) and 86.5% (CPS) as described in Section 3.4.

Table 4 shows the overall results for the three tasks. As can be observed, BERT is clearly better than FastText in all measures. In fact, perhaps surprisingly, FastText does not perform better than a naive baseline that retrieves all instances as true. This also reinforces the challenging nature of the benchmark, as even BERT is far from the human annotator performance (estimated on 85.3% for accuracy). Clearly, the definition information is more helpful than the hypernyms for BERT, while the combination of both attains the best overall results.

It is also remarkable the high recall of BERT, in contrast to the precision. This is mainly attributed to the domain-specific datasets as we are going to analyze below.

Domain-Specific Results.

Table 5 presents the results split by domain. Interestingly, FastText faces a massive challenge in adapting domains and generalising from the general to the specific domains. However, BERT shows to be much more robust to domain changes. In fact, perhaps surprisingly, the results on the domain-specific domains do not drop substantially with respect to the WNT/WKT test set, even though the training and development instances came from the same source (i.e. WordNet and Wiktionary). This can be attributed to the fact that specific domains highly constrain the set of possible senses for a word, resulting in an easier WSD classification task Magnini et al. (2002). On the other hand, WordNet is known to be quite fine-grained (e.g., the noun run has 16 different senses in WordNet, plus many other senses including run as a verb).

In general, the BERT model can attain a very high recall in the domain-specific datasets, while the precision is still not too low. This model can be helpful in a retrieval setting where the recall may be relevant - for example, when the data is going to pass to a human than can filter.

5 Conclusion

In this paper we have introduced WiC-TSV, a word sense disambiguation benchmark which differs from previous disambiguation datasets in two main ways: (1) it is framed as a binary classification task where only one target sense needs to be identified, and (2) modelling a general sense inventory is not required. Our benchmark therefore opens the floor for different disambiguation algorithms that do not require modeling the entirety of a sense inventory. Moreover, the high human performance in the task (i.e. 85.3% on average for native speakers) contrasts with the relatively low inter-annotator agreement in WSD datasets where the IAA ceiling is estimated to be around 80% Navigli (2009).

WiC-TSV also provides a crucial advantage in domain-specific settings: the fact that a general inventory covering all senses is not required facilitates the development of systems which are only aimed at modelling the domain at hand. In this initial release, in addition to a more general setting, based on WordNet and Wiktionary, three specific and heterogeneous domains are included: cocktails, medical subjects and computer science. Therefore, this will contribute to measure progress in WSD in a more realistic setting, without being tied to a general sense inventory and being flexible to different settings and domains.

In our initial experiments we found that current state-of-the-art disambiguation techniques based on pre-trained language models like BERT are very accurate to handling ambiguity, even in specialized domains with enough training data. However, it still has room for improvement as highlighted by its difference with the human performance. This benchmark can therefore opens up avenues for future research on domain-transfer and on developing general-purpose solutions which can perform well on a variety of domains without the need for large amounts of training data. Finally, as future work it would be interesting to test hybrid models which take both definitional and hypernymy information into account - in this paper we combined both sources in BERT in a simple manner, but more complex models should lead to further improvements.


  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Cited by: §4.
  • C. Fellbaum (Ed.) (1998) WordNet: an electronic database. MIT Press, Cambridge, MA. Cited by: §1.
  • M. Galkin, S. Auer, M. E. Vidal, and S. Scerri (2017) Enterprise knowledge graphs: A semantic approach for knowledge management in the next generation of enterprise information systems. In ICEIS 2017 - Proceedings of the 19th International Conference on Enterprise Information Systems, Vol. 2, pp. 88–98. External Links: ISBN 9789897582486 Cited by: §1.
  • L. Huang, C. Sun, X. Qiu, and X. Huang (2019) GlossBERT: BERT for word sense disambiguation with gloss knowledge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3507–3512. Cited by: §1.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Cited by: §4.
  • D. Jurgens and I. Klapaftis (2013) SemEval-2013 task 13: Word sense induction for graded and non-graded senses. In *SEM 2013 - 2nd Joint Conference on Lexical and Computational Semantics, Vol. 2, pp. 290–299. External Links: ISBN 9781937284497 Cited by: §2.
  • D. Loureiro and A. Jorge (2019) Language modelling makes sense: propagating representations through WordNet for full-coverage word sense disambiguation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5682–5691. Cited by: §1.
  • B. Magnini, C. Strapparava, G. Pezzulo, and A. Gliozzo (2002) The role of domain information in word sense disambiguation. Natural Language Engineering 8 (4), pp. 359–373. Cited by: §4.2.
  • S. Manandhar, I. P. Klapaftis, D. Dligach, and S. S. Pradhan (2010) SemEval-2010 Task 14: Word Sense Induction & Disambiguation. In Proceedings of SemEval, pp. 15–16. Cited by: §2.
  • R. Navigli (2009) Word Sense Disambiguation: A survey. ACM Computing Surveys 41 (2), pp. 1–69. Cited by: §5.
  • T. Pasini and J. Camacho-Collados (2020) A short survey on sense-annotated corpora. In Proceedings of the International Conference on Language Resources and Evaluation, Marseille, France. Cited by: §2.
  • M. T. Pilehvar and J. Camacho-Collados (2019) WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1267–1273. Cited by: §1, §2, §3.1.1.
  • A. Raganato, J. Camacho-Collados, and R. Navigli (2017) Word sense disambiguation: a unified evaluation framework and empirical comparison. In Proceedings of EACL, Valencia, Spain, pp. 99–110. Cited by: §1.
  • M. Röder, R. Usbeck, and A. C. Ngonga Ngomo (2018)

    Gerbil - Benchmarking named entity recognition and linking consistently

    Semantic Web 9 (5). External Links: ISSN 22104968 Cited by: §2.
  • S. Singh, A. Subramanya, F. Pereira, and A. McCallum (2012) Wikilinks: a large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical report Technical Report UM-CS-2012-015, University of Massachusetts, Amherst. Cited by: §3.1.2.
  • L. Vial, B. Lecouteux, and D. Schwab (2018) UFSAC: Unification of Sense Annotated Corpora and Tools. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), N. C. (. chair), K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, and T. Tokunaga (Eds.), Miyazaki, Japan (english). External Links: ISBN 979-10-95546-00-9 Cited by: §1.
  • L. Vial, B. Lecouteux, and D. Schwab (2019) Sense vocabulary compression through the semantic knowledge of wordnet for neural word sense disambiguation. In Proceedings of the 10th Global WordNet Conference, Cited by: §1.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) Superglue: a stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537. Cited by: §2, §4.1.1.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    arXiv preprint arXiv:1609.08144. Cited by: §4.1.1.