WiC = TSV = WSD: On the Equivalence of Three Semantic Tasks

07/29/2021 ∙ by Bradley Hauer, et al. ∙ University of Alberta 0

The WiC task has attracted considerable attention in the NLP community, as demonstrated by the popularity of the recent MCL-WiC SemEval task. WSD systems and lexical resources have been used for the WiC task, as well as for WiC dataset construction. TSV is another task related to both WiC and WSD. We aim to establish the exact relationship between WiC, TSV, and WSD. We demonstrate that these semantic classification problems can be pairwise reduced to each other, and so they are theoretically equivalent. We analyze the existing WiC datasets to validate this equivalence hypothesis. We conclude that our understanding of semantic tasks can be increased through the applications of tools from theoretical computer science. Our findings also suggests that more efficient and simpler methods for one of these tasks could be successfully applied in the other two.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The word-in-context (WiC) task is to decide whether a word has the same meaning in two different contexts Pilehvar and Camacho-Collados (2019), which is closely related to word sense disambiguation (WSD), the task of identifying the sense of a word in context Navigli (2009). Despite its relevance to semantics research, the exact relationship between WiC and WSD is still an open question. Is one of the tasks easier than the other? The goal of this paper is to shed light on the exact nature of their relationship via theoretical analysis.

Answering this open question is important as the WiC task has already attracted considerable attention from the research community. A number of proposed systems can be applied to both WSD and WiC. The original motivation of pilehvar2019 included “evaluating contextualized word and sense representations and WSD.” raganato2020 present XL-WiC, an extension of the original WiC dataset to twelve other languages. breit2021 propose the target sense verification (TSV) task, which can be viewed as intermediate between WiC and WSD: decide whether a single example of a word in context has a given sense. Most recently, the first SemEval Shared Task on Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC) reports 48 actively participating teams, most of which focus on the English dataset Martelli et al. (2021).

In this short paper, we formalize the WiC, TSV, and WSD problems using a common notation, and demonstrate that they are theoretically equivalent. Our argument is based on the observation that the soundness of all three tasks hinges on the consistency of judgments on the sameness of word meaning. We show that the tasks are reducible to one another; that is, a solution to one of them can be used to solve the other two. This finding suggests that the development of more resource-efficient, less complex methods for one of these three tasks could yield efficiency improvements in the other two. While we focus on English in this paper, we postulate that our results will generalize to other languages.

In order to empirically validate the equivalence hypothesis, we conduct an analysis of a sample of instances in the MCL-WiC dataset. The results of this manual annotation experiment indicate that our hypothesis holds with high probability. This is a surprising finding because top-performing WiC systems tend to be binary classifiers independent of any predefined sense inventory

Martelli et al. (2021).

This paper has the following structure: In Section 2, we present our theoretical argument for the equivalence of the three problems. In Section 3, we provide empirical evidence that the problem reductions work in practice. In Section 4, we conclude with some general observations on lexical semantics in light of our work.

2 The Equivalence Hypothesis

In this section, we formally define the three problems, present a theoretical argument for their equivalence, and specify the reductions.

2.1 Problem Definitions

  • WSD: Given a context which contains a single target word , return the sense of in .

  • TSV: Given a context which contains a single target word , and a sense , return True if is the sense of in , and False otherwise.

  • WiC: Given two contexts and which contain the same target word , return True if has the same meaning in both and , and False otherwise.

2.2 Problem Equivalence

The theoretical argument for the equivalence of the three problems is based on the assumption that the relation of sameness of word meaning is shared between the three problems. This is supported by the lack of distinction between meanings and senses in the original WiC task proposal.111“The proposed dataset, WiC, is based on lexicographic examples, which constitute a reliable basis to […] discern different meanings of words.” Pilehvar and Camacho-Collados (2019). On the other hand, WordNet exhibits a strict one-to-one correspondence between distinct meanings, synsets, and concepts, with each word sense corresponding to a specific synset. This implies that senses are ultimately grounded in sameness of meaning as well.222“[Each] synonym set represents one underlying lexical concept. […] Word meaning [refers] to the lexicalized concept that a [word] form can be used to express.” Miller (1995). Therefore, every word meaning distinction should correspond to a pair of distinct senses. Contrariwise, if two tokens of the same word express different concepts, their meaning must be different. This equivalence also includes the TSV problem, provided that each sense corresponds to a single WordNet synset.

2.3 Problem Reductions

We now present the three problem reductions.

To reduce WiC to WSD, solve the WSD problem for the target word tokens in both WiC sentences; return True if both returned senses of the target word are identical, and False otherwise. Formally:

To reduce WSD to TSV, for each possible sense of , construct a separate TSV instance that pairs the original context with the sense ; return the sense for which TSV is True.333The uniqueness of follows from the assumption that every content word in a meaningful utterance is used in exactly one sense. This is empirically supported by the fact that 99.7% of annotated tokens in SemCor are assigned a single sense. Formally:

To reduce TSV to WiC, construct a WiC instance that pairs the original context with an example context that contains used in the sense ; return the value of WiC. which is either True or False. Formally:

3 Datasets

In this section, we discuss and analyze the existing WiC datasets with the aim of validating the equivalence hypothesis. A genuine exception to the hypothesis would be an instance that contradicts the WiC-to-WSD reduction equivalence in Section 2.3. Since natural language is not pure logic, falsifying the hypothesis would require finding that true exceptions constitute a a substantial fraction of instances, excluding errors and omissions in lexical resources.

3.1 WiC

The original WiC dataset Pilehvar and Camacho-Collados (2019) consists of 7466 pairs of sentences from WordNet Fellbaum (1998), VerbNet Schuler (2005), and Wiktionary, divided into training, development, and test sets. Each word sense in the latter two resources was mapped to a WordNet sense via BabelNet Navigli and Ponzetto (2010, 2012). The authors report that 68% of the initial instances were drawn from WordNet, but the source of each sentence is not indicated in the data. The instances were further filtered to remove fine-grained sense distinctions. The reported inter-annotator agreement was 80% for the final pruned set, and only 57% for the pruned-out instances, which is attributed to the lack of reference sense inventories.

Since, regardless of the source, all instances were annotated automatically by checking the sense identity in WordNet, the WiC dataset cannot, by construction, contain any exceptions to the equivalence hypothesis. Nevertheless, we conducted a partial check by attempting to match each sentence in the dataset to one of the example sentences in WordNet 3.0. The fraction of instances in which both sentences are identical to a WordNet example sentence, allowing us to identify both senses of the target word, was 0.479, 0.534, and 0.522, for the training, development and test sets, respectively.

Based on these findings, we can implement the WiC-to-WSD reduction defined in Section 2 by checking whether both input sentences are among the WordNet usage examples for the target word. Combined with a random back-off, this approach would be expected to obtain the accuracy of on the training set, and 0.767 and 0.761 on the development and test sets, respectively. We note that this latter figure exceeds the current state-of-the-art of Levine et al. (2020), which also uses WordNet.

3.2 WiC-TSV

breit2021 propose Target Sense Verification (TSV), a task of deciding whether a given word in context is used in a given sense. TSV is similar to WiC in that it is also a binary classification task, but only one context per instance is provided. TSV is also similar to WSD in that there is an explicit representation of senses, but only one sense per instance is provided. Three TSV sub-tasks are defined depending on the method of defining a sense: (a) definition, (b) hypernyms, and (c) both definition and hypernyms. The definition-based subtask closely matches the formulation of TSV that we formalized in Section 2.1.

The majority of the WiC-TSV datasets of breit2021 are derived directly from the original WiC dataset. The training, development, and one of the four test sets contain 2137, 389, and 717 instances, respectively. The remaining three test sets are devoted to cocktail, medical, and computer terms, and contain 216, 205, and 168 instances, respectively. Since their focus is on named entities, these three test sets are arguably more relevant to named entity resolution than to WSD.

The instances that are derived from the original WiC are ultimately based on WordNet senses. Specifically, at most one sentence from each WiC instance was taken, and the sense of the word was recovered by reversing the process by which the WiC instances were created; that is, by finding the source of that example sentence, and identifying the corresponding WordNet synset (similar to the procedure we applied to the WiC dataset in Section 3.1). Because of this dataset construction method, no exceptions to the equivalence hypothesis can be found in WiC-TSV, just as in WiC.

3.3 MCL-WiC

martelli2021 introduce the Multilingual and Cross-lingual Word-in-Context dataset. The English portion of the dataset consists of 10k WiC instances, divided into a training set (8k instances), as well as development and test sets (1k instances each). The monolingual task is exactly the same as the original WiC task, and matches our WiC problem formalization in Section 2.1.

In contrast with the original WiC dataset, which was mostly derived from WordNet, the sentence pairs in MCL-WiC were manually selected and annotated by human annotators. Annotators were instructed to consult “multiple reputable dictionaries” to minimize the subjectivity of the annotation process, and provide support for their decisions. The inter-annotator agreement is reported as , which is much higher than on the original WiC dataset. The best-performing shared task system Gupta et al. (2021) on the English dataset achieved 93.3% accuracy, which is also much higher than the results on the original WiC dataset.

3.3.1 Annotation Experiment

The MCL-WiC dataset provides a unique test for the equivalence of WiC and WSD because its annotation method is independent of WordNet. We analyzed 60 randomly selected instances from the English training set by manually identifying the WordNet sense of each target word. We found that 55 out of 60 instances conform to the equivalence hypothesis. The remaining five instances can be divided into three categories: (1) tokenization errors in MCL-WiC; (2) missing senses in WordNet; (3) possible annotation errors in MCL-WiC. We discuss these three types of errors below.

In two instances, word tokenization errors interfere with the MCL-WiC annotations: (1) together in “the final coming together” is annotated as an adverb instead of a particle of a phrasal verb, and (2) shiner in “shoes shiners met the inspector” is annotated as a stand-alone noun instead of a part of a compound noun. These tokenization errors prevent the proper assignment of WordNet senses.

In two instances, one of the senses of the target word is missing in WordNet: (1) breach referring to an animal breaking through the surface of the water, and (2) spotter referring to a member of a motor racing team who communicates by radio with the driver. Neither of these senses is subsumed by another sense in WordNet, and both of them are present in the Oxford English dictionary444https://www.lexico.com, which was consulted by the MCL-WiC annotators.

In the final problematic instance, MCL-WiC classifies the noun campaign as having the same meaning in the contexts “during the election campaign” and “the campaign had a positive impact on behavior.” The distinction between these two senses of campaign is found in Oxford and WordNet, but not in Merriam-Webster or Wiktionary. Since the exact algorithm for resolving differences between dictionaries is not specified by martelli2021, we are unable to decide whether this is an actual exception to the hypothesis, or simply an annotation error in MCL-WiC.

In summary, a careful analysis of the sample of 60 instances reveals no clear exceptions to the equivalence hypothesis. If we conservatively assume that the instance involving the word campaign

does represent an actual exception, the binomial confidence interval for the probability of the WiC-to-WSD reduction being successful is

, which is at the level of inter-annotator agreement. We therefore conclude that the results of the annotation experiment strongly support the hypothesis.

4 Discussion

Having presented a theoretical argument for the equivalence of WiC, WSD, and TSV in Section 2, and empirical evidence for the correctness of one of the key reductions in Section 3, we now discuss our inferences, making the case for WordNet as the underlying sense inventory in WiC.

Most English WiC and TSV datasets are based on predefined sense inventories that come either directly or indirectly from WordNet. If no sense inventory is used for grounding decisions about meaning, the inter-annotator agreement is reported to be about 80% Pilehvar and Camacho-Collados (2019); Breit et al. (2021). For the MCL-WiC dataset, however, annotators consulted other dictionaries, and obtained “almost perfect agreement" Martelli et al. (2021). This suggests that sense inventories, and semantic resources in general, are crucial to reliable annotation for semantic tasks. However, because the exact MCL-WiC procedure for resolving differences between dictionaries is unspecified, and because such dictionaries vary in their availability, the correctness of the annotations cannot be verified (c.f. Section 3.3).

The adoption of WordNet as the standard sense inventory for WiC has several practical benefits: (1) it has been adopted as the standard inventory for WSD, and so would simplify multi-task evaluation; (2) it allows seamless application of systems across datasets; (3) it facilitates rapid creation of new WiC datasets based on existing sense-annotated corpora; (4) it is freely available; (5) it can be modified to correct errors and omissions; and finally (6) it can be extended to facilitate work with other languages, as in the XL-WiC dataset.

In addition, WordNet has strong theoretical advantages. Its fine-granularity is a consequence of the underlying principle of distinguishing lexical concepts. Therefore, the sense distinctions found in dictionaries either already correspond to different concepts, or should result in adding new concepts to WordNet. Furthermore, unlike in dictionaries, sense inventories of different words are inter-connected via semantic relations such as synonymy and hypernymy, which facilitate an objective assignment of every word usage to a single WordNet concept. This has allowed the WSD methods based on contextual and sense embeddings to surpass the inter-annotator agreement ceiling of around 70% Navigli (2006).

5 Conclusion

We proposed a novel hypothesis on the equivalence of three semantic tasks, and demonstrated how they can be reduced to one another. We analyzed the existing datasets to verify that the reductions work in practice. We further corroborated our conclusions by performing an annotation experiment on a dataset constructed independently from WordNet. The hypothesis provides theoretical support for the application of the WiC-to-WSD reduction. This finding is based on the properties of the WordNet sense inventory, which we have argued are highly desirable in semantics research. Future work could include applying techniques from WiC research to WSD, or joint model optimization for both tasks.


  • Breit et al. (2021) Anna Breit, Artem Revenko, Kiamehr Rezaee, Mohammad Taher Pilehvar, and Jose Camacho-Collados. 2021. WiC-TSV: An evaluation benchmark for target sense verification of words in context. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1635–1645.
  • Fellbaum (1998) Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press.
  • Gupta et al. (2021) Rohan Gupta, Jay Mundra, Deepak Mahajan, and Ashutosh Modi. 2021. Mcl@ iitk at semeval-2021 task 2: Multilingual and cross-lingual word-in-context disambiguation using augmented data, signals, and transformers. In Proceedings of the Fifteenth Workshop on Semantic Evaluation.
  • Levine et al. (2020) Yoav Levine, Barak Lenz, Or Dagan, Ori Ram, Dan Padnos, Or Sharir, Shai Shalev-Shwartz, Amnon Shashua, and Yoav Shoham. 2020. SenseBERT: Driving some sense into BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4656–4667.
  • Martelli et al. (2021) Federico Martelli, Najla Kalach, Gabriele Tola, and Roberto Navigli. 2021. SemEval-2021 task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC). In Proceedings of the Fifteenth Workshop on Semantic Evaluation (SemEval-2021).
  • Miller (1995) George A Miller. 1995. WordNet: A lexical database for English. Communications of the ACM, 38(11):39–41.
  • Navigli (2006) Roberto Navigli. 2006. Meaningful clustering of senses helps boost word sense disambiguation performance. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 105–112.
  • Navigli (2009) Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR), 41(2):10.
  • Navigli and Ponzetto (2010) Roberto Navigli and Simone Paolo Ponzetto. 2010. BabelNet: Building a very large multilingual semantic network. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 216–225.
  • Navigli and Ponzetto (2012) Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250.
  • Pilehvar and Camacho-Collados (2019) Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of NAACL-HLT, pages 1267–1273, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
  • Raganato et al. (2020) Alessandro Raganato, Tommaso Pasini, Jose Camacho-Collados, and Mohammad Taher Pilehvar. 2020. XL-WiC: A multilingual benchmark for evaluating semantic contextualization. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 7193–7206.
  • Schuler (2005) Karin Kipper Schuler. 2005.

    VerbNet: A broad-coverage, comprehensive verb lexicon

    University of Pennsylvania.