Log In Sign Up

A Short Survey on Sense-Annotated Corpora for Diverse Languages and Resources

by   Tommaso Pasini, et al.
Sapienza University of Rome

With the advancement of research in word sense disambiguation and deep learning, large sense-annotated datasets are increasingly important for training supervised systems. However, gathering high-quality sense-annotated data for as many instances as possible is an arduous task. This has led to the proliferation of automatic and semi-automatic methods for overcoming the so-called knowledge-acquisition bottleneck. In this paper we present an overview of currently available sense-annotated corpora, both manually and automatically constructed, for various languages and resources (i.e. WordNet, Wikipedia, BabelNet). General statistics and specific features of each sense-annotated dataset are also provided.


page 1

page 2

page 3

page 4


Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation

We release to the community six large-scale sense-annotated datasets in ...

Semi-Supervised and Unsupervised Sense Annotation via Translations

Acquisition of multilingual training data continues to be a challenge in...

Multilingual bottleneck features for subword modeling in zero-resource languages

How can we effectively develop speech technology for languages where no ...

Towards Automation of Sense-type Identification of Verbs in OntoSenseNet(Telugu)

In this paper, we discuss the enrichment of a manually developed resourc...

EmojiNet: Building a Machine Readable Sense Inventory for Emoji

Emoji are a contemporary and extremely popular way to enhance electronic...

Monolingual alignment of word senses and definitions in lexicographical resources

The focus of this thesis is broadly on the alignment of lexicographical ...

1 Introduction

Word Sense Disambiguation (WSD) is a key task in Natural Language Understanding. It consists of assigning the appropriate meaning from a pre-defined sense inventory to a word in context. While knowledge-based approaches to this task have been proposed Agirre et al. (2014); Moro et al. (2014); Camacho-Collados et al. (2016b); Butnaru et al. (2017); Chaplot and Salakhutdinov (2018), supervised approaches Zhong and Ng (2010); Melamud et al. (2016); Iacobacci et al. (2016); Kågebäck and Salomonsson (2016) have been more effective in terms of performance Raganato et al. (2017a), on those languages where sense-annotated datasets are available. Unfortunately, obtaining such data is heavily time-consuming and expensive Schubert (2006), and reasonable amounts of sense-annotated data tend to be available for English only. This produces the so-called knowledge-acquisition bottleneck Gale et al. (1992).

The first main approach towards building sense-annotated corpora was SemCor Miller et al. (1993), providing annotations for the WordNet sense inventory Fellbaum (1998). Since then, several semi-automatic and automatic approaches have also been proposed. These automatic efforts tend to produce noisier annotations, but their coverage has been shown to lead to better supervised and semi-supervised WSD systems Taghipour and Ng (2015b); Otegi et al. (2016); Raganato et al. (2016); Yuan et al. (2016); Delli Bovi et al. (2017); Pasini and Navigli (2017), as well as to learn effective embedded representations for senses Iacobacci et al. (2015); Flekova and Gurevych (2016).

In this short survey we present the main approaches of the literature to build sense-annotated corpora, not only for WordNet but also for multilingual sense inventories like Wikipedia or BabelNet. There have been additional works to provide sense-annotated data for other resources such as the New Oxford American Dictionary Yuan et al. (2016) or other language-specific versions like GermaNet Henrich et al. (2012). While these works on these language-specific resources are certainly relevant, in this work we have kept a focus on English WordNet and multilingual resources with a higher coverage like Wikipedia and BabelNet. For a more specific survey on corpora annotated with language-specific versions of WordNet, please refer to petrolito2014survey.

Finally, we provide a general overview and statistics of these sense-annotated resources, providing hints for a better use and interoperability among resources and languages.

2 Sense-Annotated Corpora

Figure 1: Overview of sense inventories with their corresponding sense-annotated corpora.

Several works have attempted to construct sense-annotated datasets to overcome the knowledge-acquisition bottleneck. In this section we describe the main efforts on this direction, either manual, semi-automatic or fully automatic. In particuar, we present currently available sense-annotated corpora for three resources: WordNet (Section 2.1), Wikipedia (Section 2.2) and BabelNet (Section 2.3). Figure 1 presents a general overview of the resources and the sense-annotated corpora for each sense inventory.

2.1 WordNet

WordNet Fellbaum (1998) has been one of the most widely used knowledge resource in lexical semantics. In fact, it has been the de-facto sense inventory for Word Sense Disambiguation for many years. The core unit in WordNet is the synset. A synset represents a concept or a meaning which is represented by its various lexicalizations (i.e. senses). For example, the meaning motor vehicle with four wheels can be expressed by its synonym senses auto, automobile, machine and motorcar.


The first and most prominent example of sense-annotated corpora is SemCor Miller et al. (1993). SemCor was manually annotated and consists of 352 documents and 226,040 sense annotations. SemCor has been the largest manually-annotated corpus for many years, and is the main corpus used in the literature to train supervised WSD systems Agirre et al. (2009); Zhong and Ng (2010); Raganato et al. (2017b).

SemEval evaluation datasets.

SemEval datasets provide reliable benchmarks for testing WSD systems. The main datasets from Senseval and SemEval competitions have been compiled and unified by raganatoetal:17. In particular, the datasets from Senseval-2 Edmonds and Cotton (2001), Senseval-3 task 1 Snyder and Palmer (2004), SemEval-2007 task 17 Pradhan et al. (2007), SemEval-2013 task 12 Navigli et al. (2013), and SemEval-2015 task 13 Moro and Navigli (2015). These datasets, which have in the main been used as evaluation benchmarks for WSD systems, contain a total of 7,253 sense annotations.

Princeton WordNet Gloss.

The Princeton WordNet Gloss Corpus111 is a sense-annotated corpus of textual definitions (glosses) from WordNet synsets. The corpus was tagged semi-automatically: 330,499 manually sense instances were annotated manually while the remaining annotations (i.e. 118,856) were obtained automatically. This corpus of disambiguated glosses has already been proved useful in tasks like semantic similarity Pilehvar et al. (2013), domain labeling González et al. (2012) and Word Sense Disambiguation Baldwin et al. (2008); Agirre and Soroa (2009); Camacho-Collados et al. (2015).


The task of gathering sense annotations has proved expensive and not easily scalable. That is the reason why more recent approaches have attempted to exploit semi-automatic or automatic techniques. OMSTI222 (Taghipour and Ng, 2015a, One Million Sense-Tagged Instances), which is a semi-automatically constructed corpus annotated with WordNet senses, is a prominent example. It was built by exploiting the alignment-based WSD approach of chan2005scaling on a large English-Chinese parallel corpus (Eisele and Chen, 2010, MultiUN corpus). OMSTI333Even though OMSTI was released along with SemCor, in this survey we refer to the portion of sense-annotated data from the MultiUN corpus only., coupled with SemCor, has already been successfully leveraged as training data for training supervised systems Taghipour and Ng (2015a); Iacobacci et al. (2016); Raganato et al. (2017a).

2.2 Wikipedia

Wikipedia is a collaboratively-constructed encyclopedic resource consisting of concepts and entities and their corresponding pages. In addition to a large coverage of concepts and entities, Wikipedia provides multilinguality, as it covers over 250 languages and these languages are connected via interlingual links. In this section we describe two datasets providing disambiguations from Wikipedia pages.444Note that more Wikipedia sense-annotated datasets extracted from the Wikilinks project exist Singh et al. (2012); Eshel et al. (2017). However, due to privacy and license issues, these datasets cannot be shared directly. Please also refer to usbeck2015gerbil for an overview and unification of datasets focused on Entity Linking. For these two datasets we have used the same version of Wikipedia for a more accurate comparison555We used the Wikipedia dumps of November 2014..

Wikipedia hyperlinks.

This corpus contains the whole Wikipedia corpus with hyperlinks as sense-annotated instances. Hyperlinks are highlighted mentions within a Wikipedia article that directly links to another Wikipedia page. These links are provided by Wikipedia editors. 271 languages, with a different degree of coverage, were already available in the Wikipedia dump of November 2014.


The Semantically Enriched Wikipedia666 (Raganato et al., 2016, SEW)

is a corpus based on Wikipedia where the original links have been exploited in order to annotate other content word in the same Wikipedia page with a sense from BabelNet. Its English version comprises over 160M sense annotations with an estimated precision over 90%.

2.3 BabelNet

BabelNet Navigli and Ponzetto (2012) is a wide-coverage multilingual semantic network obtained from the integration of various encyclopedias and dictionaries (inter alia WordNet and Wikipedia). Being a superset of all these resources, BabelNet brings together lexicographic and encyclopedic knowledge, thus containing named entities and concepts from over 250 languages.


SenseDefs777 Camacho-Collados et al. (2016a) extends the effort from the Princeton WordNet Gloss Corpus project (see Section 2.1) by automatically disambiguating textual definitions from various heterogeneous sources in 263 languages. The underlying idea lies on the exploitation of the cross-complementarities of definitions of identical concepts from different languages and resources. The approach couples a graph-based disambiguation method Moro et al. (2014) with a refinement based on distributional similarity Camacho-Collados et al. (2016b). The proposed method was evaluated on four European languages (English, Spanish, French and Italian) with an estimated precision of over 80%.

Resource Type #Langs #Annotations #Tokens (EN) #Annot (EN) Amb (EN)

WordNet Manual 1 226,036 802,443 226,036 6.8
SemEval-ALL WordNet Manual 1 7,253 25,503 7,253 5.8
Princeton Gloss WordNet Semi-automatic 1 449,355 1,621,129 449,355 3.8
OMSTI WordNet Semi-automatic 1 911,134 30,441,386 911,134 8.9

Wikipedia Collaborative 271 321,718,966 1,357,105,761 71,457,658 2.6
SEW Wikipedia Semi-automatic 1 162,614,753 1,357,105,761 162,614,753 7.9

BabelNet Automatic 263 163,029,131 71,109,002 37,941,345 4.6
EuroSense BabelNet Automatic 21 122,963,111 48,274,313 15,502,847 6.5
T-o-M BabelNet Automatic 6 17,987,488 291,550,966 12,722,530 1.1

Table 1: Statistics of the sense-annotated corpora.


The construction of EuroSense888 Delli Bovi et al. (2017) follows a similar approach to SenseDefs. In this case, parallel corpora is exploited for a single multilingual disambiguation. The output is a sense-annotated corpus for 21 languages for the Europarl parallel corpus Koehn (2005). The estimated precision for four languages with figures over 80% on average, with a peak of almost 90% for German.


Similarly to the previous approach, Train-o-Matic999 (Pasini and Navigli, 2017, T-o-M) aims at automatically annotating words from a raw corpus with senses. The main difference with respect to EuroSense and OMSTI lies in the fact that T-o-M does not need parallel data in order to annotate the input corpus. Being language independent and fully automatic, it has been proved to lead supervised systems to high performance Pasini et al. (2018), close or even better than those achieved when a manually annotated corpus is used for the training (e.g. SemCor Miller et al. (1993)). Moreover, it has also proved effective in languages other than English (i.e. Italian, Spanish, French, German and Chinese), as shown with the results of IMS Zhong and Ng (2010) trained on data produced by T-o-M on recent evaluation datasets.

3 Statistics

In order to have a global overview of each of the corpora, we display the main features of each of the sense-annotated resources, including lexical and language coverage, in Table 1. For each dataset we display its underlying resource, number of languages covered and total number of sense annotations. In addition to these global statistics, Table 1 shows local statistics (i.e. number of tokens, number of sense annotations and ambiguity level) for English, which is the only language covered by all corpora. The ambiguity level of each dataset is computed as the average number of candidate senses per instance (i.e., senses with the same surface form of a target word).

As can be seen, the datasets are quite heterogeneous in nature, with three different resources and constructed via four different strategies: manual, semi-automatic, automatic and collaborative. The number of sense annotations also varies depending on the resource, with Wikipedia- and BabelNet-based corpora contributing with the highest number of annotations. This is coherent with the coverage of these resources: Wikipedia and BabelNet are two orders of magnitude higher than WordNet.

4 Conclusion

In this paper we have given an overview of available sense-annotated datasets for WordNet, Wikipedia and BabelNet, and for various languages. These datasets correspond to a wide variety of approaches, from manual construction to automatic or semi-automatic methods. By providing the list and statistics we are pursuing two main goals: (1) motivating and providing information about sense-annotated corpora to be used for research purposes, and (2) highlighting benefits and potential weaknesses of the various sense-annotated corpora, either manually or automatically constructed.

Moreover, this paper represents a first step for obtaining a fully-integrated repository of sense-annotated corpora which can be easily leveraged for research and evaluation purposes. Our goal is to integrate in the future a large number of resources shared in a unified multilingual repository, following the lines of raganatoetal:17 for WordNet sense-annotated corpora in English.


Jose Camacho-Collados is supported by a Google PhD Fellowship in Natural Language Processing.


  • Agirre et al. (2009) Eneko Agirre, Oier Lopez De Lacalle, Christiane Fellbaum, Andrea Marchetti, Antonio Toral, and Piek Vossen. 2009. Semeval-2010 task 17: All-words word sense disambiguation on a specific domain. In Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions. Association for Computational Linguistics, pages 123–128.
  • Agirre et al. (2014) Eneko Agirre, Oier Lopez de Lacalle, and Aitor Soroa. 2014. Random walks for knowledge-based word sense disambiguation. Computational Linguistics 40(1):57–84.
  • Agirre and Soroa (2009) Eneko Agirre and Aitor Soroa. 2009. Personalizing PageRank for Word Sense Disambiguation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece, 30 March–3 April 2009. pages 33–41.
  • Baldwin et al. (2008) Timothy Baldwin, Nam Kim Su, Francis Bond, Sanae Fujita, David Martinez, and Takaaki Tanaka. 2008. Mrd-based word sense disambiguation: Further extending lesk. In Proceedings of International Joint Conference on Natural Language Processing. pages 775–780.
  • Butnaru et al. (2017) Andrei Butnaru, Radu Tudor Ionescu, and Florentina Hristea. 2017. Shotgunwsd: An unsupervised algorithm for global word sense disambiguation inspired by dna sequencing. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, Valencia, Spain, pages 916–926.
  • Camacho-Collados et al. (2016a) José Camacho-Collados, Claudio Delli Bovi, Alessandro Raganato, and Roberto Navigli. 2016a. A Large-Scale Multilingual Disambiguation of Glosses. In Proceedings of LREC. Portoroz, Slovenia, pages 1701–1708.
  • Camacho-Collados et al. (2015) José Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. A unified multilingual semantic representation of concepts. In Proceedings of ACL. Beijing, China, pages 741–751.
  • Camacho-Collados et al. (2016b) José Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2016b. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artificial Intelligence 240:36–64.
  • Chan and Ng (2005) Yee Seng Chan and Hwee Tou Ng. 2005. Scaling up word sense disambiguation via parallel texts. In AAAI. volume 5, pages 1037–1042.
  • Chaplot and Salakhutdinov (2018) Devendra Singh Chaplot and Ruslan Salakhutdinov. 2018. Knowledge-based word sense disambiguation using topic models. In Proceedings of AAAI.
  • Delli Bovi et al. (2017) Claudio Delli Bovi, Jose Camacho-Collados, Alessandro Raganato, and Roberto Navigli. 2017. EuroSense: Automatic harvesting of multilingual sense annotations from parallel text. In Proc.of ACL. volume 2, pages 594–600.
  • Edmonds and Cotton (2001) Philip Edmonds and Scott Cotton. 2001. Senseval-2: overview. In Proc. of SensEval 2. ACL, pages 1–5.
  • Eisele and Chen (2010) Andreas Eisele and Yu Chen. 2010. MultiUN: A Multilingual Corpus from United Nation Documents. In Proceedings of the Seventh conference on International Language Resources and Evaluation. pages 2868–2872.
  • Eshel et al. (2017) Yotam Eshel, Noam Cohen, Kira Radinsky, Shaul Markovitch, Ikuya Yamada, and Omer Levy. 2017. Named entity disambiguation for noisy text. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). Association for Computational Linguistics, pages 58–68.
  • Fellbaum (1998) Christiane Fellbaum. 1998. WordNet: An Electronic Database. MIT Press, Cambridge, MA.
  • Flekova and Gurevych (2016) Lucie Flekova and Iryna Gurevych. 2016. Supersense Embeddings: A Unified Model for Supersense Interpretation, Prediction and Utilization. In Proc. of ACL. pages 2029–2041.
  • Gale et al. (1992) William A. Gale, Kenneth Church, and David Yarowsky. 1992. A method for disambiguating word senses in a corpus. Computers and the Humanities 26:415–439.
  • González et al. (2012) Aitor González, German Rigau, and Mauro Castillo. 2012. A graph-based method to improve Wordnet domains. In Proceedings of 13th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING). New Delhi, India, pages 17–28.
  • Henrich et al. (2012) Verena Henrich, Erhard Hinrichs, and Tatiana Vodolazova. 2012. Webcage: a web-harvested corpus annotated with germanet senses. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pages 387–396.
  • Iacobacci et al. (2015) Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. SensEmbed: Learning Sense Embeddings for Word and Relational Similarity. In Proc. of ACL. pages 95–105.
  • Iacobacci et al. (2016) Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Embeddings for word sense disambiguation: An evaluation study. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. pages 897–907.
  • Kågebäck and Salomonsson (2016) Mikael Kågebäck and Hans Salomonsson. 2016. Word Sense Disambiguation using a Bidirectional LSTM. In Proc. of CogALex. pages 51–56.
  • Koehn (2005) Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X.
  • Melamud et al. (2016) Oren Melamud, Jacob Goldberger, and Ido Dagan. 2016. context2vec: Learning generic context embedding with bidirectional lstm. In Proceedings of CONLL. pages 51–61.
  • Miller et al. (1993) George A. Miller, Claudia Leacock, Randee Tengi, and Ross Bunker. 1993. A semantic concordance. In Proceedings of the 3rd DARPA Workshop on Human Language Technology. Plainsboro, N.J., pages 303–308.
  • Moro and Navigli (2015) Andrea Moro and Roberto Navigli. 2015. Semeval-2015 task 13: Multilingual all-words sense disambiguation and entity linking. In Proc. of SemEval-2015.
  • Moro et al. (2014) Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity Linking meets Word Sense Disambiguation: a Unified Approach. Transaction of ACL (TACL) 2:231–244.
  • Navigli et al. (2013) Roberto Navigli, David Jurgens, and Daniele Vannella. 2013. Semeval-2013 task 12: Multilingual word sense disambiguation. In Proc. of SemEval 2013. Atlanta, USA, pages 222–231.
  • Navigli and Ponzetto (2012) Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193:217–250.
  • Otegi et al. (2016) Arantxa Otegi, Nora Aranberri, Antonio Branco, Jan Hajic, Steven Neale, Petya Osenova, Rita Pereira, Martin Popel, Joao Silva, Kiril Simov, and Eneko Agirre. 2016. QTLeap WSD/NED Corpora: Semantic Annotation of Parallel Corpora in Six Languages. In Proc. of LREC. pages 3023–3030.
  • Pasini et al. (2018) Tommaso Pasini, Francesco Maria Elia, and Roberto Navigli. 2018. Huge Automatically Extracted Training Sets for Multilingual Word Sense Disambiguation. In Proceedings of LREC. Miyazaki, Japan.
  • Pasini and Navigli (2017) Tommaso Pasini and Roberto Navigli. 2017. Train-o-matic: Large-scale supervised word sense disambiguation inmultiple languages without manual training data. In Proceedings of Empirical Methods in Natural Language Processing. Copenhagen, Denmark.
  • Petrolito and Bond (2014) Tommaso Petrolito and Francis Bond. 2014. A survey of wordnet annotated corpora. In Proceedings Global WordNet Conference, GWC-2014. pages 236–245.
  • Pilehvar et al. (2013) Mohammad Taher Pilehvar, David Jurgens, and Roberto Navigli. 2013. Align, disambiguate and walk: A unified approach for measuring semantic similarity. In Proc. of ACL. pages 1341–1351.
  • Pradhan et al. (2007) Sameer S Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. 2007. Semeval-2007 task 17: English lexical sample, srl and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations. pages 87–92.
  • Raganato et al. (2017a) Alessandro Raganato, Jose Camacho-Collados, and Roberto Navigli. 2017a. Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison. In Proc. of EACL. Valencia, Spain, pages 99–110.
  • Raganato et al. (2016) Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2016. Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia. In Proceedings of IJCAI. New York City, NY, USA, pages 2894–2900.
  • Raganato et al. (2017b) Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2017b. Neural sequence learning models for word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pages 1167–1178.
  • Schubert (2006) Lenhart Schubert. 2006. Turing’s Dream and the Knowledge Challenge. In Proc. of AAAI. pages 1534–1538.
  • Singh et al. (2012) Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2012. Wikilinks: A large-scale cross-document coreference corpus labeled via links to wikipedia. University of Massachusetts, Amherst, Technical Report UM-CS-2012- 015 .
  • Snyder and Palmer (2004) Benjamin Snyder and Martha Palmer. 2004. The english all-words task. In Proc. of Senseval-3. Barcelona, Spain, pages 41–43.
  • Taghipour and Ng (2015a) Kaveh Taghipour and Hwee Tou Ng. 2015a. One million sense-tagged instances for word sense disambiguation and induction. CoNLL 2015 pages 338–344.
  • Taghipour and Ng (2015b) Kaveh Taghipour and Hwee Tou Ng. 2015b. Semi-Supervised Word Sense Disambiguation Using Word Embeddings in General and Specific Domains. Proc. of NAACL-HLT pages 314–323.
  • Usbeck et al. (2015) Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, et al. 2015. Gerbil: general entity annotator benchmarking framework. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, pages 1133–1143.
  • Yuan et al. (2016) Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. Semi-supervised word sense disambiguation with neural models. Proceedings of COLING pages 1374–1385.
  • Zhong and Ng (2010) Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: A wide-coverage word sense disambiguation system for free text. In Proc. of of the ACL. ACL, Uppsala, Sweden, pages 78–83.