The central idea of Semantic Web is to make the content of the Internet pages machine-interpretable. For that, web pages should be linked to ontologies [berners2001semantic, gomez2002ontology] — databases which contain the information on classes of objects, their properties, and relations between the classes. The relations between objects are particularly important in an ontology, because they form its structure. They can be of different types corresponding to different types of relationships between real-world objects. One of the most important relationships is the class-subclass relation. It allows organising entities into a taxonomy — a tree structure where entities are represented as nodes and the edges between them denote subclass-of or instance-of relationship. The class-subclass relations and taxonomies built from them are crucial for understanding the place and the purpose of an object or a concept in the world. This relation and the hierarchical structure created by it are also a basis of many knowledge bases and knowledge graphs — a particular type of knowledge bases where objects are organised in a graph structure. There, the nodes of a graph are objects, and the edges of a graph are relations between the objects.
To support the use of specific knowledge base structures, such as thesauri or taxonomies, there exist specifications and standards classification schemes. For instance, Simple Knowledge Organization Systems (SKOS)111https://www.w3.org/2004/02/skos/intro define several types of lexical-semantic relations in the terms of semantic web, e.g. “has broader/narrower”, “has exact match”, and “is in mapping relation with”. SKOS are used to create structures like thesauri.
However, SKOS does not fully comply with our taxonomy. The class-subclass relation which is the base of taxonomies is a relation between objects X and Y such that the sentence “X is a kind of Y” is acceptable for native speakers. On the other hand, the closest SKOS analogue of taxonomic class-subclass relation is the “broader term relation”. It is different from the class-subclass relation, because it is less specific. For example, it can include the part-whole relations.
The usefulness of an ontology or a knowledge base depends largely on its completeness and its ability to fully reflect the real world. However, since the world is changing, the ontologies need to be constantly updated to stay relevant. There currently exist comprehensive knowledge bases such as Freebase, DBPedia or Wikidata as well as ontologies for specific domains. Many areas of knowledge require their own knowledge bases, and all of them need to be maintained and extended. This is an expensive and time-consuming process which can only be conducted by an expert who is proficient in the discipline and understands the structure of a knowledge base. Thus, in order to speed up and simplify this task, it becomes more and more important to develop systems that could automatically enrich the existing knowledge bases with new words or at least facilitate the manual extension process. The task of automatically or semi-automatically adding new entities to hierarchical structures is referred to as taxonomy enrichment.
In this work, we aim at reviewing the existing taxonomy enrichment models and propose new methods which address their drawback. We also aim at evaluating the scalability of different methods to new languages and datasets.
The state-of-the-art taxonomy enrichment methods have two main drawbacks. First of all, they often use unrealistic formulations of the task. For example, SemEval-2016 task 14 [jurgens-pilehvar-2016-semeval] which was the first effort to evaluate this task in a controlled environment, provided definitions of the query words (words to be added to a taxonomy). This is very informative resource, so the majority of the presented methods heavily depended on those definitions [tanev-rotondi-2016-deftor, espinosa-anke-etal-2016-taln]. However, in the real-world scenarios, such information is usually unavailable, which makes the developed methods inapplicable. We tackle this problem by testing our new methods and the state-of-the-art methods in a realistic setting.
Another gap in the existing research is that the majority of methods use the information from only one source. Namely, some researchers use the information from distributional word embeddings, whereas others consider graph-based models which represent a word based on its position in a taxonomy. Our intuition is that the information from these two sources is complementary, so combining them can improve the performance of taxonomy enrichment models. Therefore, we propose a number of ways to incorporate various sources of information.
First, we propose the new DWRank method which uses only distributional information from pre-trained word embeddings and is similar to other existing methods. We then enable this method to incorporate the different sources of graph information. We compare the various ways of getting the information from a knowledge graph. Finally, another modification of our method successfully combines the information from different sources, beating the current state of the art.
To place our models in the context of the research on taxonomy enrichment, we compare them with a number of state-of-the-art models. To the best of our knowledge, this is the first large-scale evaluation of taxonomy enrichment methods. We are also the first to evaluate the methods on datasets of different sizes and in different languages.
This work is an extended version of the work described in [nikishina-etal-2020-studying, nikishina-etal-2021-exploring, tikhomirov2021meta]. The novelty of this particular article as compared to the previous publications is as follows:
We present a new taxonomy enrichment method DWRank which combines distributional information and the information extracted from Wiktionary.
We present an extension of DWRank called DWRank-Graph which uses various graph-based representations via a common interface.
We present DWRank-Meta — an extension of DWRank which combines the information from different sources and beats the state-of-the-art models.
We present WBSR — a method for taxonomy extension which leverages the information from the Web. This approach is the current state of the art in the task.
We conduct a large-scale computational study of various approaches to taxonomy enrichment, which features multiple methods (including ours as well as state-of-the-art approaches), multiple datasets and languages.
We present datasets for studying the evolution of wordnets for English and Russian, extending the monolingual setup of the RUSSE’2020 shared task [nikishina2020taxonomy] with a larger Russian dataset and similar English versions.
We explore the benefits of meta-embeddings (combinations of embeddings) and graph embeddings for the task of taxonomy enrichment.
We provide an in-depth error analysis of different types of mistakes of the state-of-the art models.
We provide mappings to WordNet Linked Open Data (LOD) Inter-Lingual Index (ILI) from Russian to English synsets. A dataset of multilingual hypernyms opens possibilities for various use-cases for cross-lingual operations on taxonomies.
We release all our code and datasets to enable further research in this direction.222https://github.com/skoltech-nlp/diachronic-wordnets
The rest of the paper is organised as follows. In Section 2 we formulate the task and provide the relevant definitions. Section 3 is devoted to the existing work on taxonomy enrichment. Then, in Section 4 we present our datasets and describe their creation process and features. Sections 5, 6, and 7 contain the descriptions of our methods: in Section 5 we introduce our core method DWRank and in Sections 6 and 7 describe its extensions DWRank-Graph and DWRank-Meta. We report the performance of our models and compare them with the other approaches in Section 8. Finally, in Section 9 we analyse the potential limitations of our methods.
2 Task Formulation and Definitions
In this subsection we formulate the task and explain the main concepts and task-related terms.
2.1 Task Formulation
Let us consider an example of taxonomy. Figure 1 demonstrates a subgraph for the word “Papuan” retrieved from WordNet. There, each concept has one or more parents (concepts which it is derived from). The parents in their turn have their own parents, and one can trace the word affiliation until the most abstract root concepts of the taxonomy. Here, the word “Papuan” is attached to the synsets “indonesian” and “natural_language” as it can mean both an ethnicity and a language. Note that words can have multiple parents for different reasons. Two or more parents can point to the fact that a word has multiple meanings (as in the case with “Papuan”). Alternatively, a word meaning can be a combination of meanings of two higher-level concepts.
Therefore, in order to add a word to a taxonomy, we need to find its hypernym among the entities (synsets in case of wordnets) of this taxonomy. Here, we refer to a word absent from the taxonomy (a word which we would like to add) as a query word. Our task is to attach query words to an existing taxonomic tree.
The task of finding a single suitable hypernym synset is difficult for a machine, because the number of nodes in the existing taxonomy can be very large (e.g. in WordNet the number of noun synsets is around 29,000). Thus, a model trained to solve this task will inevitably return many false answers if asked to provide only one synset candidate. Moreover, as we can see from Figure 1, a word may have multiple hypernyms.
Thus, in the majority of works on taxonomy enrichment the requirement of providing a single correct answer is relaxed. Instead of that, a common approach is to provide
(typically 10 to 20) most suitable candidates. This list is more likely to contain correct synsets. This setting is also consistent with the manual computer-assisted annotation. Presenting an annotator with a small list of candidates will facilitate the taxonomy extension process: annotators will only have to look through a short list of high-probability hypernym candidates instead of searching hypernyms in a list of all synsets of the taxonomy. Thus, we formulate the task oftaxonomy extension as a soft ranking problem, where the synsets are ranked according to their suitability for a given word.
This is an established approach to this task, the same formulation was also used in research works on taxonomy enrichment and in taxonomy enrichment shared tasks [jurgens-pilehvar-2016-semeval, nikishina2020taxonomy].
In this section, we provide several key definitions crucial for understanding of our work.
is a data structure used to store knowledge which comprise of concepts and relations (known as TBox-es) and individuals, concept, and relation instantiations (also known as ABox-es) [TAboxes].
is a knowledge base that uses a graph-structured data model to represent data. Formally, a knowledge graph is a set of triplets where is the entity set and is the relation set. Normally, knowledge graphs comprise multiple types of relations .
is a set of words or phrases corresponding to a concept or a knowledge base entity in the set . Synset is the major element which denotes a node in a knowledge graph , therefore, in our case equals .
is a relation . According to [miller1998wordnet], hypernymy relation exists between objects X and Y if native speakers accept sentences constructed using such patterns as “An X is a (kind of) Y”. Hypernymy is transitive and asymmetrical. Such relations are central organizing relations for nouns and verbs in WordNet [miller1998wordnet]. In fact, hypernymy relations comprise class (or subsumption) relations and instantiation relations considered in ontology studies and description logics [mcguinness1995explaining].
is a special case of a knowledge graph . It is a tree-based structure where nodes from the set are connected with the hypernymy relation . Thus, the set of relations . Elements included in the set are words or concepts. In a taxonomy they are arranged in a hierarchical structure from the most abstract concept (root) to the most specific concepts (leaves).
Query words (new words)
— words to be added to the taxonomy , usually manually collected by experts. In this paper, we use the following notation for this set . Note that words may be ambiguous: one can correspond to more than one synset .
Knowledge graph completion
is a task of completing a Knowledge Graph by finding a set of missing triples .
can be considered as a special case of the knowledge base completion task. This task aimed at associating each new word , which is not yet included in the taxonomy , with the appropriate hypernyms from it.
3 Related Work
There exist numerous approaches to knowledge base completion, which make use of various neural network architectures described in review papers[kbcompletion, paulheim2017knowledge]. Most of them apply low-dimensional graph embeddings [transe, proje]
, deep learning architectures like autoencoders[autoencoder] or graph convolutional networks [li2019ontology, dettmers2018convolutional, shang2019end]
. Another group of approaches makes use of tensor decomposition approaches: Tucker decomposition[balazevic-etal-2019-tucker] and Canonical Polyadic decomposition (CANDECOMP/PARAFAC) [pmlr-v80-lacroix18a, Lacroix20]. Another task, which is closely related to the taxonomy creation is the Knowledge base construction. There exists multiple approaches solving the problem: Text2Onto [text2onto] a pillar language-independent approach which applies user interaction or fully automated methods [DESSI2021253, klink2] which apply text-mining tools and external sources, which are mostly applied for the scholarly domain. However, knowledge base completion task assumes a generic graph while in the taxonomy enrichment task deals with tree structures and specific methods of tree processing are commonly used in this field. In this article we focus on this specific task and narrow down our scope to enrichment and population of taxonomic structures. It should be noted however that the majority of the ontologies and knowledge bases possess some kind of taxonomic backbone and therefore the task of construction and maintaining such a semantic structure is fundamental.
The existing studies on the taxonomies can be divided into three groups. The first one addresses the Hypernym Discovery problem [camacho-collados-etal-2018-semeval]: given a word and a text corpus, the task is to identify hypernyms in the text. However, in this task the participants are not given any predefined taxonomy to rely on. The second group of works deals with the Taxonomy Induction problem [bordea-etal-2015-semeval, bordea-etal-2016-semeval, velardi-etal-2013-ontolearn], in other words, creation of a taxonomy from scratch. Finally, the third direction of research is the Taxonomy Enrichment task: the participants extend a given taxonomy with new words. In this article, we focus on the taxonomy enrichment task and explore various approaches based on text and graph embeddings and their combinations to the solution of this problem. The following sections provide overview of the various strains of research related to the given problem.
3.1 Prior Art on Taxonomy Enrichment
Until recently, the only dataset for the taxonomy enrichment task was created under the scope of SemEval-2016. It contained definitions for the new words, so the majority of models solving this task used the definitions. For instance, Deftor team [tanev-rotondi-2016-deftor]
computed definition vector for the input word, comparing it with the vector of the candidate definitions from WordNet using cosine similarity. Another example isTALN team [espinosa-anke-etal-2016-taln] which also makes use of the definition by extracting noun and verb phrases for candidates generation.
This scenario may be unrealistic for manual annotation because annotators are writing a definition for a new word and adding new words to the taxonomy simultaneously. Having a list of candidates would not only speed up the annotation process but also identify the range of possible senses. Moreover, it is possible that not yet included words may have no definition in any other sources: they could be very rare (“apparatchik”, “falanga”), relatively new (“selfie”, “hashtag”) or come from a narrow domain (“vermiculite”).
Thus, following RUSSE-2020 shared task [nikishina2020taxonomy], we stick to a more realistic scenario when we have no definitions of new words, but only examples of their usage. For this shared task we provide a baseline as well as training and evaluation datasets based on RuWordNet [loukachevitch2016creating] which will be discussed in the next section. The task exploited words which were recently added to the latest release of RuWordNet and for which the hypernym synsets for the words were already identified by qualified annotators. The participants of the competition were asked to find synsets which could be used as hypernyms.
The participated systems mainly relied on vector representations of words and the intuition that words used in similar contexts have close meanings. They cast the task as a classification problem where words need to be assigned one or more hypernyms [kunilovskaya2020russe] or ranked all hypernyms by suitability for a particular word [dale2020russe]. They also used a range of additional resources, such as Wiktionary, dictionaries, additional corpora [arefyev2020russe]. Interestingly, only one of the well-performing models [tikhomirov2020russe] used context-informed embeddings (BERT) or external tools such as online Machine Translation (MT) and search engines (the best-performing model denoted as Yuriy in the workshop description paper).
In this paper, we would like to work out methods which depend on graph based structures and combine them with the approaches applying word embeddings. At the same time, we want our methods to benefit from the existing data (e.g. corpora, pre-trained embeddings, Wiktionary).
3.2 Word Vector Representations for Taxonomies
Approaches using word vector representations are the most popular choice for all tasks related to taxonomies. When solving the Hypernym Discovery problem in SemEval-2018 Task 9 [camacho-collados-etal-2018-semeval] most of participants use word embeddings. For instance, Bernier-Colborne and Barriere [bernier-colborne-barriere-2018-crim] predict the likelihood of the relationship between an input word and a candidate using word2vec embeddings. Berend et al. [berend-etal-2018-300]maldonado-klubicka-2018-adapt] simply consider top-10 closest associates from the Skip-gram word2vec model as hypernym candidates. Pre-trained GloVe embeddings [pennington-etal-2014-glove] are also used to initialize embeddings for an LSTM-based Hypernymy Detection model [shwartz-etal-2016-improving].
Participants also solve the SemEval-2016 Task 13 on taxonomy induction with word embeddings [pocostales-2016-nuig]: they compute the vector offset as the average offset of all the pairs generated and exploit it to predict hypernyms for the new data. Afterwards, in [aly-etal-2019-every] the authors apply word2vec embeddings similarity to improve the approaches of the SemEval-2016 Task 13 participants.
The vast majority of participants of SemEval-2016 task 14 [jurgens-pilehvar-2016-semeval] and RUSSE’2020 [nikishina2020taxonomy] also apply word embeddings to find the correct hypernyms in the existing taxonomy. For instance, the participants compute a definition vector for the input word by comparing it with the definition vectors of the candidates from the wordnet using cosine similarity [tanev-rotondi-2016-deftor]. Another option is to train word2vec embeddings from scratch and cast the task as a classification problem [kunilovskaya2020russe]. Some participants compare the approach based on XLM-R model [conneau-etal-2020-unsupervised] with the word2vec “hypernyms of co-hyponyms” method [arefyev2020russe]. It considers nearest neighbours as co-hyponyms and takes their hypernyms as candidate synsets.
Summing up, the usage of distributed word vector representations is a simple yet efficient approach to the taxonomy-related tasks and should be considered a strong baseline [camacho-collados-etal-2018-semeval, nikishina2020taxonomy].
3.3 Meta-Embedding Approaches to Word Representation
Vector representations can be learned on various datasets and using various models. It has been shown that combining word embeddings is beneficial for NLP tasks, e.g. dependency parsing [bansal2014tailoring], and in medical domain [chowdhury2019mixed].
Coates et al. [coates2018]
show that simple vector combining approaches, such as concatenation or averaging, can significantly improve the overall performance for several tasks. For instance, singular value decomposition (SVD) demonstrates good results with the ability to control the final dimension of vectors[yin2016learning]. Autoencoders [bollegala2018learning] further promote the idea of creating meta-embeddings. The authors propose several algorithms for combining various word vectors into one vector by encoding initial vectors into some meta-embedding space and then decoding backwards.
As for the CAEME approach, all word vectors are encoded into meta-vectors and then concatenated. Then, the decoding step uses a concatenated representation to predict the original vector representations. The AAEME approach is similar to CAEME, except that each vector is mapped to a fixed-size vector and all encoded representations are averaged, but not concatenated. An obvious advantage of this approach is the ability to control the dimension of meta-embeddings.
For any AEME approach, different loss functions can be used at the decoding stage: MSE loss, KL-divergence loss, cosine distance loss and also their combinations. In[neill2018meta] the authors investigated the performance of the autoencoders depending on the loss function. They discover that there is no evident winner across tasks and that different loss functions are defined be different tasks.
Meta-embeddings are already used in such machine learning tasks as dependency parsing[bansal2014tailoring], classification in healthcare [chowdhury2019mixed]winata2019learning, neill2018meta]neill2018meta], word similarity and analogy tasks [coates2018, bollegala2018learning, yin2016learning]. To the best of our knowledge, meta-embeddings have not been applied to the taxonomy enrichment task, especially for the fusion of texts and graph embeddings.
3.4 Graph-based Representations for Taxonomies
Taxonomies can be represented as graphs and there exist various approaches to learn graph-based representations. They are thoroughly compared in [makarov2021fusion] on the link prediction task, which is closely related to Taxonomy Enrichment. The paper also demonstrates that the combination of text and graph embeddings gives a boost on the link prediction task. Most of those methods listed in [makarov2021fusion] have also been tested on tasks related to the taxonomy enrichment.
For instance, node2vec embeddings [node2vec-kdd2016] are used for taxonomy induction among other network embeddings [liu2018interpretation]. In [aly-etal-2019-every], the authors perform the same task. They use hyperbolic Poincaré embeddings to enhance automatically created taxonomies. The SemEval-2016 subtask of reattaching query words to the taxonomy is quite similar to taxonomy enrichment which we perform. However, the datasets of the SemEval-2016 Task 13 are restricted to specific domains, which leaves an open question of the efficiency of Poincaré embeddings for the general domain and larger datasets. Moreover, [aly-etal-2019-every] use Hearst Patterns to discover hyponym-hypernym relationships. This technique operates on words, and cannot be transferred to word-synset relations without extra manipulation.
As for the Knowledge Graph Construction task, which is a more general task in relation to the Taxonomy Induction, the vast majority of approaches also use word embeddings as node representations. Several approaches like [luan-etal-2018-multi] and apply ELMO embeddings [peters-etal-2018-deep] to predict entities and their relations for the knowledge graph. Other approaches [han2020wikicssh] utilize a combination of ELMO, Poincaré and node2vec embeddings to enhance knowledge graph build upon Wikipedia.
Graph convolutional networks (GCNs) [kipf2016semi] as well as graph autoencoders [kipf2016variational] are mostly applied to the link prediction task on large knowledge bases. For example, in [rossi2020knowledge] the authors present an expanded review of the field and compare a wide variety of existing approaches. Graph embeddings are also often used for other taxonomy-related tasks, e.g. entity linking [pujary2020disease]. As for the taxonomy enrichment task, we are only aware of a recent approach TaxoExpan [shen2020taxoexpan] which applies position-enhanced graph neural networks (GCN [kipf2017semi] and GAT [velickovic2018graph]) that we also evaluate on our datasets333The results achieved on our datasets are significantly lower than the baseline, probably because of the incorrect model launching..
Thus, to the best of our knowledge, our work is the first computational study of Taxonomy enrichment task which aggregates and considers different existing and new approaches for taxonomy enrichment. We compare graph- and word-based representations computed from the synsets and hypo-hypernym relations for hypernym prediction demonstrating state-of-the-art results.
4 Diachronic WordNet Datasets
The important part of our study is the observation that one can learn from the history of the development of lexical resources though time. More specifically, we make use of the various historic snapshots (versions) of WordNet lexical graphs and setup a task of their automatic completion assuming the manual update of the ground truth. This diachronic analysis – similar to diachronic lexical analysis of word meanings – is used to build two datasets in our study: one for English, another one for Russian based respectively on Princeton WordNet [miller1995wordnet] and RuWordNet taxonomies. It is important to mention that by using the word “diachronic” we do not imply lexical diachrony, e.g., semantic shifts of words over time [schlechtweg-etal-2020-semeval], but the temporal extension of Wordnet stored in its versions. Each dataset consists of a taxonomy and a set of novel words to be added to this resource. The statistics are provided in Table 1.
|WordNet1.6 - WordNet3.0||17 043||755|
|WordNet1.7 - WordNet3.0||6 161||362|
|WordNet2.0 - WordNet3.0||2 620||193|
|RuWordNet1.0 - RuWordNet2.0||14 660||2 154|
4.1 English Dataset
To compile dataset, we choose two versions of WordNet and then select words which appear only in a newer version. For each word, we get its hypernyms from the newer WordNet version and consider them as gold standard hypernyms. We add words to the dataset if only their hypernyms appear in both versions. We do not consider adjectives and adverbs, because they often introduce abstract concepts and are difficult to interpret by context. Besides, the taxonomies for adjectives and adverbs are worse connected than those for nouns and verbs making the task more difficult.
In order to find the most suitable pairs of releases, we compute WordNet statistics (see Table 2). New words demonstrate the difference between the current and the previous WordNet version. For example, it shows that the dataset generated by “subtraction” of WordNet 2.1 from WordNet 3.0 would be too small, they differ by nouns and verbs. Therefore, we create several datasets by skipping one or more WordNet versions. The statistics for the datasets we select for our study are provided in Table 1.
|WordNet 1.6||66 025||12 127||94 474||10 319||-||-|
|WordNet 1.7||75 804||13 214||109 195||11 088||11 551||401|
|WordNet 2.0||79 689||13 508||114 648||11 306||4 036||182|
|WordNet 2.1||81 426||13 650||117 097||11 488||2 023||158|
|WordNet 3.0||82 115||13 767||117 798||11 529||678||33|
|Word||Translation||RuWordNet hypernyms||RuWordNet hypernym names||WordNet hypernyms|
As gold standard hypernyms, we use not only the immediate hypernyms of each lemma (initial form of a word — infinitive for a verb, single number and nominative case for a noun, etc.) but also the second-order hypernyms: hypernyms of the hypernyms. We include them in order to make the evaluation less restricted. According to our empirical observations, the task of automatically identifying the exact hypernym might be too challenging, and finding the “region” where a word belongs (“parents” and “grandparents”) can already be considered a success.
This method of dataset construction does not use any language-specific or database-specific features, so it could be transferred to other wordnets or taxonomies with timestamped releases.
All datasets444https://zenodo.org/record/4279821 created for this research and the code555https://github.com/skoltech-nlp/diachronic-wordnets for their construction are publicly available.
4.2 Russian Datasets
In order to create an analogous version to English dataset for Russian, we use the RuWordNet taxonomy [loukachevitch2016creating]. RuWordNet comprises synsets — sets of synonyms expressing a particular concept. A synset consists of one or more senses — words or multi-word constructions in the initial form. Therefore, we use the current version of RuWordNet and the extended version of RuWordNet which has not been published yet to compile the dataset (cf. Table 1).
The RUSSE’2020 dataset was created for the Dialogue Evaluation [nikishina2020taxonomy] and can be viewed as a restricted subset of the Russian dataset. In the RUSSE’2020 the following categories of words were excluded:
all three-symbol words and the majority of four-symbol words;
diminutive word forms and feminine gender-specific job titles;
words which are derived from words which are included in the published RuWordNet;
words denoting inhabitants of cities and countries;
geographic and personal names;
compound words that contain their hypernym as a substring.
4.3 WordNet ILI mapping (ru-en)
In order to connect wordnets in different languages the Inter-Lingual Index (ILI) is used [bond-etal-2016-cili]. This mapping is designed to make possible coordination between wordnet projects. For the Russian test sets we also provide mapping from RuWordNet to WordNet666https://doi.org/10.5281/zenodo.4969267. For each hypernym synset of each query word we present the corresponding WordNet synset index. Table 3 demonstrates several examples of this kind of mapping. As we can see, datasets for the Russian nouns and verbs are extended with additional column called “WordNet synsets”, where the corresponding WordNet3.0 synsets are listed in accordance with the Russian synset list.
However, not all synsets have an equivalent in the other language, as there exist untranslatable concepts and lacunae. Therefore, we present in the Table 4 the coverage of the WordNet synsets for the hypernyms of query words from the test set. This mapping can be further used for multilingual experiments.
|Input type||Total||Have ILI mapping|
|Restricted (private) nouns|
|restricted (private) verbs|
4.4 Evaluation Metric
The goal of diachronic taxonomy enrichment is to build a newer version of a wordnet by attaching the new given terms to the older wordnet version. We cast this task as a soft ranking problem and use Mean Average Precision (MAP) score for the quality assessment:
where and are the number of predicted and ground truth values, respectively, is the fraction of ground truth values in the predictions from 1 to , is the label of the -th answer in the ranked list of predictions, and is the indicator function.
This metric is widely acknowledged in the Hypernym Discovery shared tasks, where systems are also evaluated over the top candidate hypernyms [camacho-collados-etal-2018-semeval]. The MAP score takes into account the whole range of possible hypernyms and their rank in the candidate list.
However, the design of our dataset disagrees with MAP metric. As we described in Section 4, the gold-standard hypernym list is extended with second-order hypernyms (parents of parents). This extension can distort MAP. If we consider all gold standard answers as compulsory for the maximum score, it means that we demand models to find both direct and second-order hypernyms. This disagrees with the original motivation of including second-order hypernyms to the gold standard — it was intended to make the task easier by allowing a model to guess a direct or a second-order hypernym.
On the other hand, if we decide that guessing any synset from the gold standard yields the maximum MAP score, we will not be able to provide an adequate evaluation for words with multiple direct hypernyms. There exist two cases thereof:
the target word has two or more hypernyms which are co-hyponyms or one is a hypernym of the other — this word has a single sense, but the annotator decided that multiple related hypernyms are needed to reflect all shades of the meaning,
the target word has two or more hypernyms which are not directly connected in the taxonomy and neither are their hypernyms. This happens if:
the word’s sense is a composition of senses of its hypernyms, e.g. “impeccability” possesses two components of meaning: (“correctness”, “propriety”) and (“morality”, “righteousness”);
the word is polysemous and different hypernyms reflect different senses, e.g. “pop-up” is a book with three-dimensional pages (“book, publication”) and a baseball term (“fly, hit”).
While the case 2a corresponds to a monosemous word and the case 2b indicates polysemy, this difference does not affect the evaluation process. We propose that in both these cases in order to get the maximum MAP score a model should capture all the unrelated hypernyms which correspond to different components of sense. At the same time, we should bear in mind that guessing a direct hypernym or a second-order hypernym are equally good options. Therefore, following [nikishina2020taxonomy], we evaluate our models with modified MAP. It transforms a list of gold standard hypernyms into a list of connected components. Each of these components includes hypernyms (both direct and second-order) which form a connected component in a taxonomy graph. (According to graph theory, connected component is a subgraph, in which there is a path between any two nodes.) Thus, in the case 1 we will have a single connected component, and a model should guess any hypernym from it to get the maximum MAP score. In the cases 2a and 2b we will have multiple components, and a model should guess any hypernym from each of the components.
5 Base Methods
Here we first describe our baseline model which is a method of synset ranking based on distributional embeddings and hand-crafted features (the method was proposed as a baseline for RUSSE-2020 shared task [nikishina-etal-2020-studying]
). We then propose extending it with new features extracted from Wiktionary and use the alternative sources of information about words (e.g. graph representations) and their combinations.
We consider the approach by Nikishina et al. [nikishina-etal-2020-studying] as our baseline. There, we first create a vector representation for each synset in the taxonomy by averaging vectors (pretrained embeddings) of all words from this synset. Then, we retrieve top 10 synsets whose vectors are the closest to that of the query word (we refer to these synsets as synset associates). For each of these associates, we extract their immediate hypernyms and hypernyms of all hypernyms (second-order hypernyms). This list of the first- and second-order hypernyms forms our candidate set. We need to rank the candidates by their relevance for the query word. Note that the lists of candidates for different associates can have intersections. When forming the overall candidate set, we make sure that each candidate occurs in it only once.
The intuition behind the method is the following. We propose that if a synset of a taxonomy is a parent of a word which is similar to our query word, it can also be a parent of this query word.
To rank the candidate set of synsets we train a Linear Regression model with L2-regularisation on the training dataset formed of the words and synsets of WordNet. Candidate hypernyms are ranked by their model output score. We limit the output to thebest candidates.
We rank the candidate set using the following features:
, where is a vector representation of a word or a synset , is a hypernym, is the number of occurrences of this hypernym in the merged list, is the cosine similarity of the vector of the input word and hypernym vector ;
the candidate presence in the Wiktionary hypernyms list for the input word (binary feature);
the candidate presence in the Wiktionary synonyms list (binary feature);
the candidate presence in the Wiktionary definition (binary feature);
the average cosine similarity between the candidate and the Wiktionary hypernyms of the input word.
We present a new method of taxonomy enrichment — Distributional Wiktionary-based synset Ranking (DWRank). It combines distributional features with features from Wiktionary. DWRank builds up on the baseline described in Section 5.1. We extend the baseline Logistic Regression model with the new features which mainly account for the number of occurrences of a synset in the candidate lists of different synset associates (nearest neighbours) of the query word. We introduce the following new features:
the number of occurrences (n) of the synset in the merged candidate list and the quantity which serves for smoothing,
the minimum, average, and maximum proximity level of the synset in the merged candidate list:
the level is 0 if the synset was added based on similarity to the query word,
the level of 1 is for the immedidate hypernyms of the query word,
the level of 2 is for the hypernyms of the hypernyms,
the minimum, average, and maximum similarities of the query word to all words of the synset,
the features based on hyponyms of a candidate synset (“children-of-parents”):
we extract all hyponyms (“children”) of the candidate synset,
for each word/phrase in each hyponym synset we compute their similarity to the query word,
we compute the minimum, average, and maximum similarity for each hyponym synset,
we form three vectors: a vector of minimums of similarities, average similarities, and maximum similarities of hyponym synsets,
for each of these vectors we compute minimum, average, and maximum. We use these resulting 9 numbers as features.
These features account for different aspects of similarity of the candidate’s children to the query word and help defining if these children can be the query word’s co-hyponyms (“siblings”).
Moreover, in this approach we use cross-validation and feature scaling when training the Logistic Regression model.
This methods could be easily extended to other languages that possess a taxonomy, a wiki-based open content dictionary (Wiktionary) and text embeddings like fastText or/and word2vec and GloVe.
5.3 Web-based Synset Ranking (WBSR)
In this section, we describe WBSR (Web-based Synset Ranking) a method which leverages the power of the existing general-purpose services. It makes use of the two famous search engines: Google (for both English and Russian datasets) and Yandex777https://yandex.com (for Russian only). According to our hypothesis, the search results for a word are likely to contain its hypernyms or co-hyponyms as they are often used to define a word via generalisation or by providing synonyms (co-hyponyms). For instance, if we do not know what “abdominoplasty” is, searching for it with a search engine can yield its definition “a cosmetic surgery procedure”.
Another source that we could probably benefit from is another taxonomy, preferably larger than the one we work with. However, there might be no other taxonomies available in the same language. Therefore, in this case we can resort to Machine Translation and automatically translate query words into a rich-resource language (e.g. English) in order to use an existing taxonomy (e.g. English Princeton WordNet). In this study we use Yandex Machine Translation system888https://translate.yandex.ru/ to translate query words into English and then translate hypernyms (if they are found) back into Russian.
The main drawback of using external sources such as search engines and machine translation systems is their weak reproducibility. The search results are dependent on the search history, so reproducing the experiment on a different account or after a relatively long period of time is problematic. However, since the method greatly improves the performance even with trivial handling of the collected data, we use it despite its drawback. To make our results reproducible, we release all data from the external sources used in our approach.999https://doi.org/10.5281/zenodo.4540717
Similarly to the approaches described above, here we also make use of Wiktionary and fastText embeddings cosine similarity. However, we treat synsets and words/phrases that they consist of in a different way. In the previously described approaches we computed embeddings for multiword phrases by averaging word embeddings of individual words in them. Here we treat them as sentences — we compute their embeddings using the get_sentence_vector method from fastText Python library. There, fastText vectors are divided by their norms and then averaged, so that only vectors with the positive -norm value are considered. Secondly, we do not combine the word/phrase vectors into a synset vector but operate with the word/phrase embeddings directly.
Similarly to DWRank, the algorithm consists of two steps: candidate generation and candidate ranking. Our candidate list is formed of the following synsets:
synsets which contain words/phrases from the list of top-10 nearest neighbours of the query word;
hypernyms and second-order hypernyms of those synsets;
synsets that contain words/phrases listed in Wiktionary as the hypernyms of the query word;
hypernym synsets of these synsets;
cross-lingual candidates (for the Russian language only):
synsets that contain words/phrases listed in the English WordNet as the hypernyms of the query word;
hypernym synsets of these synsets.
Analogously to DWRank, we then rank all candidates by a logistic regression model which uses the following features:
the candidate synset contains a word/phrase from the list of query word’s nearest neighbours;
the candidate synset is a hypernym of one of the nearest neighbours;
the candidate is a second-order hypernym of one of the nearest neighbours;
the candidate synset contains a hypernym of the query word from Wiktionary;
the candidate synset is a hypernym of the synset which contains a hypernym of the query word from Wiktionary;
the candidate synset contains a word which is present in the definition of the query word from Wiktionary;
the candidate synset is present in the list of English WordNet synset candidates (for the Russian language only);
the candidate synset is present in the list of hypernyms of the WordNet candidates (for the Russian language only);
the candidate synset contains words which occur on the Google results page;
the candidate synset contains words which occur on the Yandex results page (for the Russian language only).
The training set for the logistic regression model is formed from a wordnet in the relevant language as follows. For each query word in the list of query words we first find the most similar lemma which is contained in the wordnet. We know hypernyms for these lemmas and use them to generate the training set. We generate the candidate list as described before. First- and second-order hypernyms in this candidate list are used as positive examples for the corresponding lemmas, and synsets from the candidate list which are not hypernyms are considered negative examples.
This approach participated in the RUSSE’2020 Taxonomy Enrichment task for the Russian Language. The method achieved the best result on the nouns track. Therefore, we consider it as the-state-of-the-art method for Russian.
5.4 WordNet Path Prediction
A completely different approach to make use of fastText embeddings is presented in the work of Cho et al. [cho-etal-2020-leveraging]. The authors experiment with encoder-decoder models in order to solve the task of the direct hypernym prediction. They use a standard LSTM-based sequence-to-sequence model [sutskever2014sequence] with Luong attention [luong-etal-2015-effective]. First, they average fastText embeddings for the input word or phrase and put it through the encoder. The decoder sequentially generates a chain of synsets from the encoder hidden state, conditioned on the previously generated ones. The authors consider two different setups:
hypo2path — given the input word, generate a sequence of synsets starting from the root synset and going down the taxonomy to the closest hypernym;
hypo2path reverse — given the input word, generate a sequence of synsets starting from the closest hypernym up to the root entity.
To be able to apply this sequence-to-sequence architecture to our data, we build new datasets similar to the ones described in [cho-etal-2020-leveraging]. We generate a path from the WordNet starting from the root node to the target synset or word. Analogously to the original work, we include multiple paths from the root to the parents of the query word. We filter the validation set to only include queries that do not occur anywhere in the full taxonomy paths of the training data. To sort candidates generated by the decoder, we enumerate the generated hypo2path sequence from the right to the left or the hypo2path reverse from the left to the right and get the first 10 synsets.
Additionally, we extend this approach by replacing the LSTM+attention architecture with the Transformer architecture [NIPS2017_3f5ee243]. During training we provide an embedding of a synset as input to the Transformer and expect the model to generate a sequence of synsets starting from the hypernym of the input synset. During inference we provide embedding of query words as input expect the model to output sequences of synsets starting with the direct hypernyms.
5.5 Word Representations for DWRank
We test our baseline approach and DWRank with different types of embeddings: fastText [bojanowski-etal-2017-enriching], word2vec [word2vec] embeddings for English and Russian datasets and also GloVe embeddings [pennington-etal-2014-glove] for the English dataset.
We use the fastText embeddings from the official website101010https://fasttext.cc/docs/en/crawl-vectors.html
for both English and Russian, trained on Common Crawl from 2019 and Wikipedia CC including lexicon from the previous periods as well. For word2vec we use models from[fares-etal-2017-word, KutuzovKuzmenko2017] for both English111111http://vectors.nlpl.eu/repository/20/29.zip and Russian.121212http://vectors.nlpl.eu/repository/20/185.zip We lemmatise words and synsets for both languages with the same UDPipe [udpipe:2017] model which was used while training the representations. For the out-of-vocabulary (OOV) words we find all words in the vocabulary with the longest prefix matching this word and average their embeddings like in [dale2020russe]. As for the GloVe embeddings, we also use them from the official website131313https://nlp.stanford.edu/projects/glove/ trained on Common Crawl, the vocabulary size is 840 billion tokens.
The DWRank method extracts a set of candidate synsets based on the similarities of word vectors. So far we used only distributional word vectors (fastText, GloVe, etc.) to represent words. On the other hand, graph-based representations can contain the taxonomic information which is absent in distributional embeddings [makarov2021fusion].
Here, we present DWRank-Graph. This is the same DWRank method where the distributional embeddings are replaced with graph representations. The score prediction model and the features it uses do not change. Below we describe the graph representations and their combinations we applied in DWRank-graph.
6.1 Poincaré Embeddings
Poincaré embeddings is an approach for “learning hierarchical representations of symbolic data by embedding them into hyperbolic space — or more precisely into an ”-dimensional Poincaré ball” [nickel2017poincare]. Poincaré models are trained on hierarchical structures and simultaneously capture hierarchy and similarity due to the underlying hyperbolic geometry. According to the authors, hyperbolic embeddings are more efficient on the hierarchically structured data and may outperform Euclidean embeddings in several tasks, e.g, in Taxonomy Induction [aly-etal-2019-every].
Therefore, we use Poincaré embeddings of our wordnets for the taxonomy enrichment task. We train Poincaré ball model for our wordnets using the default parameters and the dimensionality of 10, which yields the best results on the link prediction task [nickel2017poincare].
However, applying these embeddings to the task is not straightforward, because Poincaré model’s vocabulary is non-extensible. It means that new words that we need to attach to the existing taxonomy will not have any Poincaré embeddings at all and we cannot make use of the embeddings similarity. To overcome this limitation, we compute top-5 fastText nearest synsets (analogously to the procedure described in Section 5.1) and then aggregate embeddings in hyperbolic space using Einstein midpoint, following [guh2016hyperbolic]. The resulting vector is considered as an embedding of the input word in the Poincaré space.
Then, we use vectors from this vector space to generate candidates for the DWRank approach. As the model we present in Section 5.2 does not depend on the types of input embeddings, we are able to provide Poincaré embeddings as input.
6.2 Node2vec Embeddings
The hierarchical structure of the taxonomy is a graph structure, and we may also consider taxonomies as graphs and apply random walk approaches to compute embeddings for the synsets. For this purpose we apply node2vec [node2vec-kdd2016] approach which represents a “random walk of a fixed length ” with “two parameters and which guide the walk in breadth or in depth”. Node2vec randomly samples sequences of nodes and then applies the skip-gram model [DBLP:journals/corr/abs-1301-3781] to train their vector representations. We train node2vec representations of all synsets in our wordnets with the following parameters: , , . The other parameters are taken from the original implementation.
However, analogously to Poincaré vector space, node2vec model has no techniques for representing out-of-vocabulary words. Thus, it is unable to map new words to the vector space. To overcome this limitation, we apply the same technique of averaging top-5 nearest neighbours from fastText and considering their mean vector as the new word embedding and search for the most similar synsets.
6.3 Graph Neural Networks
The models described above have a major shortcoming: the resulting vectors for the input words heavily depend on their representations in the fastText model. This can lead to the incorrect results if the word’s nearest neighbour list is noisy and does not reflect its meaning. In this case the noise will propagate through the graph embedding (Poincaré or node2vec) model and result in inaccurate output even if the graph embedding model is of high quality.
Therefore, we test different graph neural network (GNN) architectures — Graph Convolutional Network [kipf2016semi], Graph Attention Network [velickovic2018graph] and GraphSAGE [hamilton2017inductive] (SAmple and aggreGatE) which make use of both fastText embeddings and the graph structure of the taxonomy.
All the above mentioned models work similarly. According to [makarov2021fusion], “GCN works similarly to the fully-connected layers for neural networks. It multiplies weight matrices with the original features but masking them with an adjacency matrix. Such a method allows to account not only node representation, but also representations of neighbours and two-hop neighbours.” The GraphSAGE model addresses the problem of unseen nodes representation by training a set of aggregator functions that learn to aggregate feature information from a node’s local neighborhood. GCN, on the contrary, learns a distinct embedding vector for each node. In GAT, the convolution operation from GCN is replaced with the attention mechanism. It uses the self-attention mechanism of Transformers [transformer] to aggregate the information from the one-hop neighbourhood.
FastText embeddings are used as input node features for all models, which is definitely an advantage of the model over Poincaré and node2vec, as they do not use word embeddings for training. Even though new words are not connected to the taxonomy, it is still possible to compute their embeddings according to their input node features.
We get the vector representations of query words from one of the pre-trained GNN models and then use them as the input to DWRank. Even though all methods work similarly, they demonstrate different performance on different datasets.
6.4 Text-Associated Deep Walk
Text-Associated Deep Walk (TADW) [yang2015network] is another approach that incorporates text and graph information into one vector representation. The method is based on the DeepWalk algorithm [perozzi2014deepwalk] which learns feature representations by simulating uniform random walks. To be specific, the sampling strategy in DeepWalk can be seen as a special case of node2vec with and .
The authors prove that the DeepWalk approach is equivalent to matrix factorization. They incorporate text features of vertices into network representation learning within the framework of matrix factorization. First, they define matrix where each entry is the logarithm of the average probability that vertex randomly walks to vertex in a fixed number of steps. In comparison to DeepWalk, where the goal is to factorize matrix into the product of two low-dimensional matrices and (), TADW aims to factorize matrix into the product of three matrices: , and text features . As text features, in this work we apply fastText embeddings.
After having learnt the factorisation of matrix , we use rows of matrix as node (synset) embeddings in DWRank.
6.5 High-Order Proximity preserved Embeddings
High-Order Proximity preserved Embeddings (HOPE) [ou2016asymmetric] is yet another approach that embeds a graph into a vector space preserving information about graph properties and structure. Unfortunately, most structures cannot preserve the asymmetric transitivity, which is a critical property of directed graphs. To solve the problem, the authors employ matrix factorization to directly reconstruct asymmetric distance measures like Katz index, Adamic-Adar or common neighbors. This approach is scalable — it preserves high-order proximities of large scale graphs, and capable of capturing the asymmetric transitivity. HOPE outperforms state-of-the-art algorithms in tasks of reconstruction, link prediction and vertex recommendation.
As for our Taxonomy Enrichment task, we also apply HOPE to generate graph embeddings to be used as input in the DWRank model. The main difference with the other embeddings is that HOPE does not incorporate textual information from the nodes.
In DWRank we employed only distributional information, i.e. pre-trained word embeddings, whereas in DWRank-Graph we represented words using the information from the graph structure of the taxonomy and usually ignoring their distributional properties. Meanwhile, taxonomy enrichment models may benefit from combining these two types of information. Therefore, we present DWRank-Meta — an extension of DWRank which combines multiple types of input word representations.
As in DWRank-Graph, the process of candidates selection, the feature set and the algorithm of synset ranking stay intact. Here we change the input representations of words and synsets.
7.1 Base Meta-Embeddings
The easiest way of combining embeddings of different types is to concatenate them and use the concatenated vector as an input. We refer to this method as Concat
and combine different subsets of distributional (fastText, word2vec, GloVe) and graph-based (Poincaré, node2vec, GCN, GAT, GraphSAGE, TADW, HOPE) embeddings. In addition to that, we perform Singular value decomposition (SVD) over this concatenation as proposed in[yin2016learning]. This approach is referred to as SVD.
7.2 Autoencoded Meta-Embeddings
We propose using two variants of autoencoders for the generation of meta-embeddings: Concatenated Autoencoded Meta-Embeddings (CAEME) and Averaged Autoencoded Meta-Embeddings (AAEME) [bollegala2018learning]. They have shown good results on the task of evaluating lexical similarity. However, they have never been applied to taxonomy enrichment.
We generate meta-embeddings as follows. Let us consider an embedding model . For each of such embedding models we train an autoencoder consisting of an encoder and a decoder:
where and are the encoder and the decoder, and is the loss used for training of the autoencoder. The loss is implemented as the distance (dist) between the original and the reconstructed embeddings. The dist can be any distance or similarity measure such as MSE, KL-divergence, or cosine distance. In our preliminary study, the cosine distance showed the best results, so we use it in our experiments.
Let us consider two embedding models and . In such a case, the input to each decoder is not the result of the corresponding encoder, but meta-embeddings, which depends on the both encoders. Depending on the approach, meta-embeddings can be built in different ways, we construct the meta-embeddings as follows. In case of CAEME, we take an -normalised concatenation of the two source embeddings encoded with respective encoders and :
where is the concatenation operation.
The drawback of this model is the growing dimensionality of meta-embeddings for cases where we combine multiple source embeddings. To fight that, we can replace the concatenation operation with averaging, yielding AAEME. It computes meta-embedding of a word from its two source embeddings and as the -normalised sum of internal representations and :
In CAEME, the dimensionality of the meta-embedding space is the sum of the dimensions of the source embeddings, whereas in AAEME it stays the same. The AAEME encoder can be seen as a special case of the CAEME encoder where the meta-embedding is computed by averaging the two encoded sources in equation 3 instead of their concatenation.
7.3 Training of Autoencoders
We can impose additional restrictions on *AEME models during training. One of such restrictions is the use of triplet loss. We restrict a word to be closer to the words that are semantically related to it according to the taxonomy than to a randomly chosen word with some :
where ||.|| is a distance function, is the target word, and are positive and negative words, respectively.
The algorithm of calculating triplet loss is as follows:
for each word presented in the taxonomy, we compile a list of semantically related words which includes synonyms, hyponyms and hypernyms;
at each epoch, we randomly selectpositive words from this related words set and form a set of negative words by selecting them randomly from the vocabulary;
if the word is not presented in the taxonomy, then we cannot form a list of related words for it. In this case, we generate positive vectors for it by adding random noise to its vector;
next, we calculate the triplet margin loss by combining the triplet loss with the original loss as .
We use the following parameters for the triplet loss: , , . These parameters were selected via grid search with AAEME algorithm on the English 1.7 dataset.
In this section, we report and discuss the performance of our models in the taxonomy enrichment task. We experiment with our DWRank approach and its modifications DWRank-Graph and DWRank-Meta. In addition to that, we compare them with the baseline introduced in RUSSE’2020 shared task and with a number of state-of-the-art methods. We conduct the experiments with English and Russian wordnets.
8.1 Experimental Setup
For each result we add the standard deviation values. We calculate them as follows. We randomly sample 80% of the test data and calculate the MAP scores on that part of the test set. We repeat the same procedure 30 times and then calculate the standard deviation on those 30 MAP values.
The MAP metric should be interpreted as follows: the higher the score, the better the results. For instance, means that all of the correct hypernyms are present in the first top- positions in the ranked list of candidates. Moreover, with the first candidate is always correct. means that at least one of the correct hypernyms is present in the two first positions (top-2) in the ranked lists of candidates. means that at least one of the correct hypernyms is present in the first three positions (top-3) in the ranked lists of candidates.
We show the performance of different methods on the nouns attribution task for English (nouns 1.6) in Figure 2 and for Russian (non-restricted nouns) in Figure 3. The X axis shows the MAP score for each method, the methods are listed along the Y axis. For DWRank-Meta models, we list the embeddings used in the model in brackets. The colors of bars in figures correspond to different types of input embeddings. The orange color stands for the vanilla DWRank – DWRank which uses only distributional embeddings. Purple denotes the DWRank-Graph variants. For the DWRank-Meta approaches exploiting only distributional embeddings we use pink, and DWRank-Meta on word and graph embeddings is denoted with the bright green color. Previous SOTA approaches are shown in yellow.
Figures 2 and 3 show that the leaderboard for both English and Russian nouns is dominated by DWRank-Meta models. While English benefits from the union of distributional and graph embeddings, for Russian distributional embeddings alone perform on par with their combinations with graph embeddings. Besides that, high-performing variants of DWRank-Meta for English feature TADW, node2vec, and GraphSAGE, whereas for Russian TADW is the only graph embedding model which does not decrease the scores of DWRank-Meta.
We see that triplet loss significantly improves the results for DWRank-Meta models (cf. AAEME/CAEME with and without triplet loss) for both English and Russian.
On the other hand, DWRank-Graph fails in the task of taxonomy extension for all datasets. TADW model is the only graph embedding model which can compete with DWRank-Meta models. This can be explained by the fact that TADW is an extended version of DeepWalk and applies the skip-gram model with the pre-trained fastText representations. In contrast to that, the other graph models suffer from the noisy representations of OOV query words.
At the same time, despite the success of TADW, it does not outperform models based solely on distributional embeddings, showing that graph representations apparently do not contribute any information which is not already contained in distributional word vectors.
We also notice that for both languages the baselines are quite competitive. They are substantially worse than the best-performing models, but they are much simpler to implement and is fast and easy to train. Therefore, we suggest that it should be preferred in the situation of limited resources and time.
However, the choice of embedding models is crucial for the baselines (as well as for the vanilla DWRank which performs closely). We see that fastText outperforms word2vec and GloVe embeddings for almost all languages and datasets. The low scores of GloVe and word2vec embeddings on baseline and DWRank methods can be explained by data coverage issues. Fixed vocabularies of word2vec and Glove do not allow generating any representation for missing query words, whereas fastText can handle them.
Neither of SOTA models managed to outperform the fastText baseline or approach the best DWRank-Meta variants. Web-based synset ranking (WBSR) model shows that the information from online search engines and Machine Translation models is beneficial for the task – its performance without this information drops dramatically. However, this information is not enough to outperform the word embedding-based models.
The performance of hypo2path model is even lower than that of WBSR. Being an autoregressive generative model, it is very sensitive to its own mistakes. Generating one senseless hypernym can ruin all the following chain. Conversely, when starting with the root hypernym “entity.n.01”, it often takes a wrong path. Finally, TaxoExpan model relies on definitions of words which we did not provide in this task. Therefore, its results are close to zero. We do not consider them credible and provide them in italics.
Performance for different datasets
Figures 2 and 3 as well as the results in the appendix show that there is no single best-performing model. While DWRank-Meta is almost always the best, for different datasets different variations of this model are the most successful. The results are usually consistent for the same language and part of speech (e.g. for different versions of English nouns datasets the best-performing model is the same), but there are exceptions to this regularity.
MAP metric which is the standard way of evaluating taxonomy enrichment models has a serious drawback. Namely, it is not interpretable, which hampers the understanding of the models’ performance.
Therefore, in addition to MAP we report the Precision@k score which can be interpreted as the ratio of correct synsets in the top-k outputs of a model. We evaluate our best systems automatically with the Precision@k () score. The choice of the values of is explained by the fact that the average number of true ancestors is 2 for the English words and 3 for the Russian words. Thus, Precision@k for will be unfairly understated, because there will always be at most 3 correct answers out of k. This means that for the the maximum is 0.75, for it is 0.6, etc.
|English nouns 1.6-3.0|
|DWRank-Meta (Words + Graph)||0.311||0.199||0.154|
|English verbs 1.6-3.0|
|DWRank-Meta (Words + Graph)||0.259||0.164||0.124|
|Russian non-restricted nouns|
|DWRank-Meta (Words + Graph)||0.397||0.255||0.192|
|Russian non-restricted verbs|
|DWRank-Meta (Words + Graph)||0.341||0.231||0.180|
Table 5 shows the Precision@k scores for the best performing English and Russian models on the nouns datasets. Both of them are DWRank-Meta models with AAEME autoencoders. The English model uses three types of distributional embeddings and TADW graph embeddings, while the Russian model uses only fastText and word2vec but benefits from the triplet loss. We see that Precision@k is particularly high for Russian. There, over a half of generated lists contain a correct synset in the first position. This shows that DWRank-Meta can successfully be used as a helper tool for taxonomy extension.
The results for English are lower. However, this should not be considered as a sign of lower performance of models for English. The Russian and English datasets consist of different words, so they cannot be directly compared.
9 Error Analysis
To better understand the difference in systems performance and their main difficulties, we made a quantitative and qualitative analysis of the results.
9.1 Comparison of Graph-based Approaches with Word Vector Baselines
First of all, we wanted to know to what extent the set of correct answers of graph-based models overlaps with the one of fastText-based models. In other words, we would like to know if the graph representations are able to discover hypernymy relations which could not be identified by word embeddings.
Therefore, for each new word we computed average precision () score and compared those scores across different approaches. We found that at least 90% words for which fastText failed to identify correct hypernyms (i.e. words with ) also have the of 0 in all the graph-based models. This means that if fastText cannot provide correct hypernyms for a word, other models cannot help either. Moreover, all words which are correctly predicted by graph-based approaches, are also correctly predicted by fastText. Moreover, only 8% to 55% words correctly predicted by fastText are also correctly predicted by any of the graph-based models. At the same time, the number of cases where graph-based models perform better than fastText is very low (3–5% cases). Thus, combining them cannot improve the performance significantly. This observation is corroborated by the scores of the combined models.
To contrast the performance of the text and graph embeddings and to demonstrate the input and the output formats of the models we present Tables 7 and 9 in Appendix A. They demonstrate the main features of the tested approaches. The examples do not pretend do be the general case example, however, they do provide the idea about ranking of the results and the performance of text, graph and fusion embedding types.
9.2 Performance on Polysemous Words
The differences in word semantics make the dataset uneven. In addition to that, we would also like to understand whether the performance of models depends on the number of connected components (possible meanings) for each word. Thus, we examine how many words with more than one meaning can be predicted by the system.
Figure 4 depicts the distribution of synsets over the number of senses they convey. As we can see, the vast majority of words are monosemous. For Russian nouns, the system correctly identifies almost half of them, whereas for other datasets the share of correctly predicted monosemous words is below 30%. This stems from the fact that for distributional models it is difficult to capture multiple senses in one vector. They usually capture the most widespread sense of a word. Therefore, the number of predicted synsets with two or more senses is extremely low. A similar power law distribution would be obtained using BERT embeddings, as we are still averaging embeddings from all contexts. This may be one of the reasons why contextualised models did not perform better than the fastText models which capture the main meaning only but do it well.
9.3 Error Types
In order to understand why a large number of word hypernyms (at least 60%) are too difficult for models to predict, we turn to manual analysis of the system outputs. We find out that errors can be divided into two groups: system errors caused by distributional models limitations and taxonomy inaccuracies. Therefore, we come across five main error types:
Type 1. Extracted nearest neighbours can be semantically related words but not necessary co-hyponyms:
delist (WordNet); expected senses: get rid of; predicted senses: remove, delete;
хэштег (hashtag, RuWordNet); expected senses: отличительный знак, пометка (tag, label); predicted senses: символ, короткий текст (symbol, short text).
Type 2. Distributional models are unable to predict multiple senses for one word:
latakia (WordNet); expected senses: tobacco; municipality city; port, geographical point; predicted senses: tobacco;
запорожец (zaporozhets, RuWordNet); expected senses: житель города (citizen, resident); марка автомобиля, автомобиль (car brand, car); predicted senses: автомобиль, мототранспортное средство, марка автомобиля (car, motor car, car brand).
Type 3. System predicts too broad / too narrow concepts:
midweek (WordNet); expected senses: day of the week, weekday; predicted senses: time period, week, day, season;
медянка (smooth snake, RuWordNet); expected senses: неядовитая змея, уж (non-venomous snake, grass snake); predicted senses: змея, рептилия, животное (snake, reptile, animal).
Type 4. Incorrect word vector representation: nearest neighbours are semantically far from the meaning of the inputt word:
falanga (WordNet); expected senses: persecution, torture; predicted senses: fish, bean, tree, wood.;
кубокилометр (cubic kilometer, RuWordNet); expected senses: единица объема, единица измерения (unit of capacity, unit of measurement); predicted senses: город, городское поселение, кубковое соревнование, спортивное соревнование (city, settlement, competition, sports contest).
Type 5. Unaccounted senses in the gold standard datasets, inaccuracies in the manual annotation:
emeritus (WordNet); expected senses: retiree, non-worker; predicted senses: professor, academician;
сепия (sepia, RuWordNet); expected senses: морской моллюск “sea mollusc”; predicted senses: цвет, краситель (color, dye).
In order to check how useful the predicted synsets are for a human annotator (i.e. if a short list of possible hypernyms can speed up the manual extension of a taxonomy), we conduct the manual evaluation of 10 random nouns and 10 random verbs for both languages (the words are listed in Table 6). We focus on worse-quality cases and thus select words whose MAP score is below 1. Annotators with the expertise in the field and the knowledge of English and Russian were provided with guidelines and asked to evaluate the outputs from our best-performing system. Each word was labelled by 4 expert annotators, Fleiss’s kappa is (substantial agreement) for both datasets.
We compute Precision@k score (the share of correct answers in the generated lists from position 1 to ) for from 1 to 10, shown in Figure 5. We can see that even for words with MAP below 1 our model manages to extract useful hypernyms.
In this work, we performed a large-scale computational study of various methods for taxonomy enrichment. We also presented datasets for studying diachronic evolution of wordnets for English and Russian, extending the monolingual setup of the RUSSE’2020 shared task [nikishina2020taxonomy] with a larger Russian dataset and similar English versions.
We presented a new taxonomy enrichment method called DWRank, which combines distributional information and the information extracted from Wiktionary outperforming the baseline method from [nikishina-etal-2020-studying] on the English datasets. We also presented its extensions: DWRank-Graph and DWRank-Meta which use graph and meta- embeddings via a common interface.
We also explored the benefits of meta-embeddings (combinations of embeddings) and graph embeddings for the task of taxonomy enrichment. On the Russian datasets DWRank-Meta performed best using fastText and word2vec word embeddings. For the English dataset the combination of word (fastText, word2vec and GloVe) and graph (TADW) embeddings demonstrated the best performance.
In this paper we also presented WBSR — a method for taxonomy extension which leverages the information from the Web. This approach was the current state of the art in the task for the Russian language. Now it lags significantly behind the new DWRank-meta approaches.
According to our experiments, word vector representations are simple, powerful, and extremely effective instrument for taxonomy enrichment, as the contexts (in a broad sense) extracted from the pre-trained word embeddings (fastText, word2vec, GloVe) and their combination are sufficient to attach new words to the taxonomy. TADW embeddings are also useful and efficient for the taxonomy enrichment task and in combination with the fastText, word2vec and GloVe approaches demonstrate SOTA results for the English language and compatible results for the Russian language.
Error analysis also reveals that the correct synsets identified by graph-based models are usually retrieved by the fastText-based model alone. This makes graphs representations mostly irrelevant and excessive. Nonetheless, there exist cases where graph representations were able to identify correctly some hypernyms which were not captured by fastText.
Despite the mixed results of the application of graph-based methods, we propose further exploration of the graph-based features as the existing resource contains principally different and complementary information to the distributional signal contained in text corpora. One way to improve their performance, may be to use more sophisticated non-linear projection transformations from word to graph embeddings. Another promising way in our opinion is to explore other types of meta-embeddings to mix word and graph signals, e.g. GraphGlove [ryabinin2020embedding]. Moreover, we find it promising to experiment with temporal embeddings such of those of [GoelKazemiBrubakerPoupart2020] for the taxonomy enrichment task.
Last but not least, we plan to explore methods that do not rely on the set of pre-defined candidates for inclusion in a taxonomy, as generation and mining of such a set may be a challenging problem in its own.
The participation of Mikhail Tikhomirov in the reported study (experiments with meta-embeddings) was funded by RFBR, project number 19-37-90119. The work of Natalia Loukachevitch (preparation of RuWordNet data for the experiments, mappings of the RuWordNet synsets to English WordNet Linked Open Data (LOD) Inter-Lingual Index (ILI)) is supported by the Russian Science Foundation (project 20-11-20166). The work of Alexander Panchenko, Varvara Logacheva, and Irina Nikishina was conducted in the framework of joint MTS-Skoltech laboratory.
Appendix A Ranked Examples and All Results
In this Appendix we provide Tables 7, 9 and 9 with the examples that demonstrate the input and the output formats of the models as well as Tables 12 and 13 that show the performance of all models for both English and Russian datasets.
From both tables 7 and 9 (underlined bold text denotes predictions of the model from the ground truth) we can see that at least one of the correct candidates usually appears in the list of candidates from word embeddings (first part of the table), whereas among candidates from graph embeddings we do not see any decent synsets. Poincaré embeddings retrieved by aggregating words from fastText provide too broad concepts which are clearly too far from the correct answers (“activity.n.01”, “exposure.n.03”, “action.n.02”). Node2vec embeddings are both semantically far and abstract. GraphSAGE is sticking to the word “play” and are too far from the correct answers in general. TADW manages to predict the correct synset “therapy.n.01” in the list of candidates, however, its position is much lower than the positions of the same synsets among the candidates provided by word embeddings-based systems.
The candidates for the words “play therapy” and “eyewitness” provided by models based on text embeddings do contain at least one true answer in the list. The position of such words varies from the forth to the sixth position.
DWRank-Meta on both word and graph embeddings may provide the results that improve the ranking, e.g. “therapy.n.01” is the correct candidate on the first place in the list for both AAEME triple loss (fastText, word2vec and GloVe) approach and for the AAEME (fastText, word2vec, Glove, TASDW) approach. For the word “eyewitness” the DWRank-Meta models on words and graphs the true candidates are placed in the worse positions, however, they still win before the DWRank-GRaph approach.
|fastText (baseline)||fastText (DWRank)||
|islamic_calendar_month.n.01, calendar_month.n.01, fast.n.01, abstinence.n.02|
|fastText (baseline)||fastText (DWRank)||
|fastText (baseline)||fastText (DWRank)||
|cover.v.05, broach.v.01, chew_over.v.01, think.v.03|
|fastText (baseline)||fastText (DWRank)||
|protect.v.01, defend.v.02, inject.v.01, administer.v.04|
|fastText (baseline)||fastText (DWRank)||
|DWRank-Meta (Meta-embeddings based on Word Embeddings)|
|CAEME triplet loss (words)||0.3320.001||0.3940.003||0.4510.004||0.2730.007||0.2050.007||0.2560.013|
|AAEME triplet loss (words)||0.3350.001||0.3910.003||0.4530.004||0.2800.008||0.2120.007||0.2620.014|
|TADW [yang2015network] (on fastText)||0.3500.001||0.3920.002||0.4350.004||0.2680.007||0.2010.007||0.2170.010|
|Poincaré [nickel2017poincare] (top-5 fastText associates)||0.1850.001||0.2110.002||0.2290.002||0.2080.006||0.1470.006||0.1720.012|
|node2vec [node2vec-kdd2016] (top-5 fastText associates)||0.2700.001||0.3120.002||0.3410.004||0.1750.006||0.1280.007||0.1180.012|
|DWRank-Meta (Meta-embeddings based on Word and Graph Embeddings)|
|SVD (words + node2vec)||0.3430.001||0.3830.003||0.4340.005||0.2720.006||0.1940.009||0.2390.011|
|CAEME (words + node2vec)||0.3350.001||0.3790.003||0.4260.004||0.2420.005||0.1840.009||0.2210.012|
|AAEME (words + node2vec)||0.3500.001||0.3940.003||0.4460.004||0.2520.007||0.1840.008||0.2080.012|
|SVD (words + TADW)||0.3550.001||0.4140.003||0.4720.004||0.2880.007||0.2220.009||0.2800.013|
|CAEME (words + TADW)||0.3500.001||0.4040.003||0.4580.004||0.2670.007||0.2120.007||0.2470.011|
|AAEME (words + TADW)||0.3670.001||0.4180.002||0.4800.004||0.2830.007||0.2270.007||0.2600.012|
|SVD (words + GCN)||0.3230.001||0.3850.003||0.4430.004||0.2600.005||0.2090.009||0.2490.011|
|CAEME (words + GCN)||0.3310.001||0.3950.003||0.4570.004||0.2510.006||0.2070.009||0.2350.012|
|AAEME (words + GCN)||0.3310.001||0.3920.003||0.4560.004||0.2430.006||0.2000.008||0.2280.012|
|SVD (words + GraphSAGE)||0.3380.001||0.4010.003||0.4640.004||0.2390.006||0.1940.009||0.2210.011|
|CAEME (words + GraphSAGE)||0.3230.001||0.3820.003||0.4350.004||0.2000.006||0.1700.007||0.2020.01|
|AAEME (words + GraphSAGE)||0.3430.001||0.4060.003||0.4680.004||0.2380.007||0.1780.008||0.2090.011|
|WBSR (Top-1 RUSSE’2020 for nouns)||0.3330.002||0.3930.003||0.4360.003||0.2520.006||0.2060.011||0.2520.013|
|hypo2path rev [cho-etal-2020-leveraging]||0.2640.001||0.2830.003||0.2380.007||0.1730.005||0.1040.008||0.1180.009|
|DWRank (Word Embeddings)|
|DWRank (Meta-embeddings based on Word Embeddings)|
|CAEME triplet loss (words)||0.4490.001||0.5810.005||0.3740.003||0.4270.010|
|AAEME triplet loss (words)||0.4740.001||0.5930.006||0.3990.004||0.4490.010|
|DWRank (Graph embeddings)|
|Poincaré [nickel2017poincare] (top-5 fastText associates)||0.3360.001||0.4760.005||0.2440.004||0.3390.009|
|node2vec [node2vec-kdd2016] (top-5 fastText associates)||0.3430.002||0.4770.005||0.2260.003||0.3220.010|
|DWRank (Meta-embeddings based on Word and Graph Embeddings)|
|SVD (words + node2vec)||0.3670.001||0.5210.005||0.2520.003||0.3510.010|
|CAEME (words + node2vec)||0.3700.001||0.5330.005||0.2670.003||0.3620.010|
|AAEME (words + node2vec)||0.3730.001||0.5290.005||0.2720.003||0.3580.010|
|SVD (words + TADW)||0.4690.001||0.6040.006||0.3940.005||0.4550.010|
|CAEME (words + TADW)||0.4290.001||0.5710.005||0.3490.003||0.4370.009|
|AAEME (words + TADW)||0.4610.001||0.5840.005||0.3620.004||0.4390.009|
|SVD (words + GCN)||0.3950.001||0.5540.005||0.2910.004||0.3560.009|
|CAEME (words + GCN)||0.3890.001||0.5440.005||0.3020.003||0.3810.008|
|AAEME (words + GCN)||0.3860.001||0.5450.006||0.2950.004||0.3650.008|
|SVD (words + GraphSAGE)||0.4100.001||0.6030.005||0.3360.004||0.4260.009|
|CAEME (words + GraphSAGE)||0.3210.001||0.5410.005||0.2660.004||0.3450.007|
|AAEME (words + GraphSAGE)||0.4090.001||0.5770.006||0.3230.004||0.4190.009|
|WBSR (Top-1 RUSSE’2020 for nouns)||0.3930.002||0.5520.005||0.2930.004||0.4280.010|
|WBSR, no search engine features||0.3690.002||0.4970.005||0.2670.004||0.3870.009|
|Top-1 RUSSE’2020 for verbs: [dale2020russe]||0.2880.001||0.4180.006||0.3410.004||0.4520.012|
|hypo2path rev [cho-etal-2020-leveraging]||0.2460.001||0.3420.006||0.1510.003||0.1940.008|
|hypo2path rev transformer [cho-etal-2020-leveraging]||0.2340.001||0.3310.004||0.1520.003||0.2010.008|