Semantic Relatedness for Keyword Disambiguation: Exploiting Different Embeddings

02/25/2020 ∙ by María G. Buey, et al. ∙ University of Zaragoza 0

Understanding the meaning of words is crucial for many tasks that involve human-machine interaction. This has been tackled by research in Word Sense Disambiguation (WSD) in the Natural Language Processing (NLP) field. Recently, WSD and many other NLP tasks have taken advantage of embeddings-based representation of words, sentences, and documents. However, when it comes to WSD, most embeddings models suffer from ambiguity as they do not capture the different possible meanings of the words. Even when they do, the list of possible meanings for a word (sense inventory) has to be known in advance at training time to be included in the embeddings space. Unfortunately, there are situations in which such a sense inventory is not known in advance (e.g., an ontology selected at run-time), or it evolves with time and its status diverges from the one at training time. This hampers the use of embeddings models for WSD. Furthermore, traditional WSD techniques do not perform well in situations in which the available linguistic information is very scarce, such as the case of keyword-based queries. In this paper, we propose an approach to keyword disambiguation which grounds on a semantic relatedness between words and senses provided by an external inventory (ontology) that is not known at training time. Building on previous works, we present a semantic relatedness measure that uses word embeddings, and explore different disambiguation algorithms to also exploit both word and sentence representations. Experimental results show that this approach achieves results comparable with the state of the art when applied for WSD, without training for a particular domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In any information system which requires user interaction, being able to understand the user is a crucial requirement, which is often tackled by limiting the user input (e.g., presenting predefined forms with fixed options). The more freedom you provide the user with, the more difficult interpretation the computer has to do to achieve a useful interaction. In such a context, being capable of disambiguating the input words (i.e., associating each word with its proper meaning in a given context) is the starting point of any interpretation process done by the computer.

Usually, such a disambiguation process is tackled from a Natural Language Processing (NLP) perspective (Navigli, 2009), assuming rich linguistic information, such as Part of Speech (POS), dependencies between words, etc., which is very useful to perform the task. However, due to the world wide use of Web search engines, users are very used to keyword interfaces and they still express their needs in such terms. In this scenario, although there are some studies which point out that keyword search queries (aka., Web search queries) exhibit their own language structure (Barr et al., 2008; Pinter et al., 2016; Roy et al., 2016), we still need methods to disambiguate the meanings of the words which do not need such an information as it might not be available.

Recent advances in NLP have focused on the development of different embedding models (Bengio et al., 2003; Mikolov et al., 2013; Le and Mikolov, 2014; Mancini et al., 2017; Camacho-Collados et al., 2016; Pennington et al., 2014), which are a set of language modeling and feature learning techniques where elements from a vocabulary are mapped to a vectorial space capturing their distributional semantics (Sahlgren, 2008)

. While there are different methods to build word embeddings, the latest (and most successful) word embedding techniques rely on neural network architectures 

(Bengio et al., 2003; Mikolov et al., 2013; Le and Mikolov, 2014; Pennington et al., 2014). Their usage as the underlying input representation has boosted the performance of different NLP tasks (Socher et al., 2013a, b)

. However, in the context of disambiguation tasks, one of the main limitations of word embeddings is that the possible meanings of a word are combined into a single representation, i.e., a single vector in the semantic space. Such a limitation can be avoided by representing individual meanings of words as distinct vectors in the space (e.g., sense2vec 

(Mancini et al., 2017)). However, there are scenarios where we do not know all the different senses at training time (e.g., open domain scenarios where we cannot find all the possible meanings in a sense catalog), and, even if we would know them, we would require to have annotated data (which might be unavailable or expensive to obtain). Besides, we would need to train a model for each new scenario or new meaning that would be added to our catalog. Thus, we need a disambiguation method able to relate words and their senses in a flexible and general way (i.e., independently of the domain we are working in) by exploiting the available resources.

In this paper, we propose a keyword disambiguation method which is based on the semantic relatedness (the degree in which two objects are related by any kind of semantic relationship (Budanitsky and Hirst, 2006)) between words, taking advantage of the semantic information captured by word embeddings. Our proposal makes possible to measure the relatedness not only among plain words but also among senses of words (which, in a Semantic Web context, can be expressed as ontological terms), and it is able to work independently of the resources used, i.e., the sense inventory whose meanings we want to map to and the word embedding model used as input.

For this purpose, we build on the work by Gracia and Mena on semantic relatedness (Gracia and Mena, 2008) and disambiguation (Gracia and Mena, 2009). These works exploited the information about word co-occurrence frequencies provided by existing Web search engines. We evolve and adapt them to improve their performance using different kinds of embeddings (both at word and sentence level). The main benefit of such an adaptation is two-fold: 1) we exploit the semantics captured by embeddings which goes further than co-occurrence of terms, and 2) we decouple the proposal from any Web search engine, being able to use off-the-shelf models trained by third parties for our purposes. This has an important side-effect: our measure can be easily adapted to any domain which we have a document corpus from. Indeed, this adaptation would require a training step, but it would be unsupervised and the only data required would be the corpus of documents itself.

To evaluate our approach, we have carried out a thorough experimentation in the context of Word Sense Disambiguation (WSD), where we have used different pre-trained word embeddings publicly available on the Web, and WordNet111https://wordnet.princeton.edu/ as sense repository. Our measure improves the performance obtained in (Gracia and Mena, 2008), and achieves state of the art WSD values without the need of specific training for a specific sense inventory. This is especially relevant, for example, for systems based on keyword input and/or which have to work with dynamically selected ontologies (Bobed and Mena, 2016) or even with ontologies extracted directly from the Web (Movshovitz-Attias et al., 2015). All the experimental data and evaluation results are available online222https://bit.ly/2lqCzop.

The rest of the paper is structured as follows. Section 2 discusses related works. In Section 3 we describe our semantic relatedness measure approximation, in Section 4 we present the disambiguation algorithm that we use, and Section 5 summarizes our experimental results. Finally, our conclusions and future work appear in Section 6.

2. Related Work

Semantic relatedness is the degree in which two objects are related by any kind of semantic relationship (Budanitsky and Hirst, 2006)

and lies at the core of many applications in NLP (such as WSD, Information Retrieval, Natural Language Understanding, or Entity Recognition). The term is often confused with semantic similarity, which measures the degree in which two objects are similar or equivalent. For example, ”car” is similar to ”bus”, but is also related to ”road” and ”driving”. It has received a great research interest and different types of methods have been developed: it can be statistically estimated (e.g. co-occurrence-based methods 

(Landauer et al., 1998)) and learned (e.g., distributional measures that estimate semantic relatedness between terms using a multidimensional space model to correlate words and textual contexts (Mohammad and Hirst, 2012)); or it can be computed using a taxonomy or a graph (e.g., ontologies) to define the distance between terms or concepts (Pirró, 2012). Indeed, most methods rely on particular lexical resources (dictionaries, thesauri, or well structured taxonomies such as WordNet1).

Regarding disambiguation, WSD methods can be classified into four conventional approaches: supervised 

(Vial et al., 2018), unsupervised (Correa Jr et al., 2018), semi-supervised (Yuan et al., 2016), and knowledge-based methods (Chaplot and Salakhutdinov, 2018). For example, in a way similar to us, in the SemEval 2015 All-Words Sense Disambiguation and Entity Linking task333http://alt.qcri.org/semeval2015/task13/

, the majority of the approaches (LIMSI, SUDOKU, EBL-Hope, etc.) that best performed in WSD relied on the combination of unsupervised learning of semantic information from the content of a corpus (such as SemCor) and/or on lexical resources as sense inventories (such as WordNet or BabelNet) to disambiguate the sense of words in natural language sentences. However, to our knowledge, no previous works (excepting those of Gracia and Mena 

(Gracia and Mena, 2008, 2009)) have studied specific methods for the disambiguation of words in keyword-based inputs, where the linguistic information is scarce.

Regarding the resources we use in our approach, word embeddings represent words in a low-dimensional continuous space and they are used to capture syntactic and semantic information from massive amounts of textual content. In recent years, they have gained great popularity due to this ability and many NLP applications have taken advantage of the potential of these distributional models. Bengio et al. (Bengio et al., 2003) preceded a wide number of current language model techniques and several authors have proposed their own approaches (Le and Mikolov, 2014; Pennington et al., 2014; Camacho-Collados et al., 2016) to construct word embeddings vectors where word2vec (Mikolov et al., 2013) is the most widely used.

Despite their advantages, one of the main limitations of word embeddings is that possible meanings of a word are conflated into a single representation. Sense embeddings (e.g., sense2vec (Mancini et al., 2017)) are proposed as a solution to this problem: individual meanings of words are represented as distinct vectors in the space. These approaches are classified in two categories by how they model meaning and where they obtain it from (Camacho-Collados and Pilehvar, 2018): 1) unsupervised models which learn word senses from text corpora (by inducing different senses of a word, analyzing its contextual semantics in a text corpus and representing each sense based on the statistical knowledge derived from the corpus), and 2) knowledge-based methods which exploit sense inventories of lexical resources for representing meanings (such as WordNet1, Wikipedia444https://www.wikipedia.org/, BabelNet555https://babelnet.org/, etc.). We can also find models that not only provide representations of words, but also the senses of the words in a joint embedded space. This is the case of NASARI vectors (Camacho-Collados et al., 2016) which not only provide accurate representation of word senses in different languages, but they also include both concepts and named entities, all included in a single unified semantic space. However, in the first case (i.e., unsupervised models), we cannot target a particular sense inventory or ontology to perform the disambiguation, not having control for example about the concept detail/granularity, and, besides, the detected senses might not be aligned to any particular human-readable structure; in the second case, we need to know all the senses at training time, not being able to adapt to new scenarios (e.g., addition/deletion of senses in the directory, evolving ontologies, etc.). Thus, the sweet spot would be neither to require re-training nor newly labelled data, while being capable to perform the disambiguation against any sense repository.

Although sense embeddings capture and represent information about meanings and can be used to calculate the sense that a word has in a specific context, word embeddings have also been shown to have good performance in disambiguation tasks (Iacobacci et al., 2016). Therefore, we wanted to explore how we could push further the usage of word embeddings for keyword disambiguation. Working at word level (as starting point) allows us to use a semantic relatedness measure between terms and reuse available resources, without needing to train explicitly new word embeddings neither for a specific task nor for newly added possible senses (i.e., adapting to any given sense dictionary or ontology). We have taken as baseline the works presented in (Gracia and Mena, 2008, 2009). In (Gracia and Mena, 2009), the authors provide a keyword disambiguation algorithm that uses the semantic relatedness measure defined in (Gracia and Mena, 2008) to find the appropriate sense for keywords. The authors focused on a method that exploits the Web as a source of knowledge, and a transformation of the Normalized Google Distance (NGD) (Cilibrasi and Vitanyi, 2007) into a mixed way of relatedness measure (between ontology terms and plain words). We propose to, on the one hand, substitute this distance with a word embedding based one to take advantage of the semantics captured by embeddings, improving the performance regarding using just co-occurrence of terms; and, on the other hand, explore modifications of their algorithm to improve their disambiguation capabilities.

Finally, as pointed out by Lastra-Díaz et al. (Lastra-Díaz et al., 2019), the embeddings that behave the best for disambiguation purposes are those which capture not only distributional semantics of texts, but also structural information about the possible meanings. We aim at achieving this disambiguation performance in a more flexible way, decoupling linguistic surface from the actual sense catalog (i.e., ontology) in order to adapt to new (i.e., unknown at training time) possible meanings, and being capable to apply it to keyword inputs, where the linguistic information is scarce.

3. Relatedness Measure based on Word Embeddings

Word embeddings can be used out-of-the-box to compute relatedness between words. However, they do not suffice in situations in which a relatedness has to be computed between senses (e.g., ontology terms in a Semantic Web context) or between senses and words. To that end, we ground on a previously defined relatedness measure between senses proposed by Gracia and Mena (Gracia and Mena, 2008). The authors proposed a method to compute the semantic relatedness between ontology terms (which we can see as individual senses), and an extension to calculate it between plain words and terms. Their proposal was built on the notion of the ontological context of a term, which is constructed combining the synonyms and the hypernyms of the ontological term (or sense). Given an ontological term , they defined its ontological context (denoted by ) as the minimum set of other ontological terms that belong to its semantic description, locating the term in the ontology and characterizing its meaning. For example, in the WordNet taxonomy, the class “Java” (in the sense of “an Indonesian island”), is well characterized and distinguished from other senses by considering its direct hypernym “Island” (see Figure 1).

Figure 1. Example of the semantic description of the term ”Java” in WordNet.

Then, given two ontological terms and , their relatedness measure is computed as:

(1)

with and computed as follows:

(2)
(3)

where refers to the relatedness between words (as it will be defined later on in equations 7 and 9); and are the set of synonyms (equivalent labels, including the term label) of ontological terms a and b; and are the terms of their ontological context666Notice that and .. Each a and b is characterized by taking into account two levels of their semantic description: Level 0) the term label and its synonyms (Equation 2), and Level 1) its ontological context (Equation 3). and are used to weight these levels777We set these values as as indicated in (Gracia and Mena, 2008)..

This measure can be also applied between an ontology term and a plain word which provides us with a value which indicates the relatedness degree between a sense and a word. So, in that case, the previous equations are computed as follows:

(4)
(5)
(6)

Here, is the distance that the authors used in (Gracia and Mena, 2008) to measure how two plain words are related. They proposed a generalization of the Cilibrasi and Vitányi’s Normalized Google Distance NGD(x,y) (Cilibrasi and Vitanyi, 2007) to use any Web search engine as source of frequencies. This generalization is called Normalized Web Distance NWD(x,y), whose smaller values represent greater semantic relation between words. Although most of NWD values fall between 0 and 1, it ranges from 0 to . Therefore, to obtain a proper relatedness measure in the range [0, 1] that increases inversely to distance, they proposed the following transformation:

(7)

To explore the use of emerging word-embedding techniques in this context and compare them with those based on search engines, we propose to exploit the semantic capabilities of word embeddings in this formulation and substitute the

measure. We could use the cosine similarity distance between the embedding vectors of the words, i.e., using the following equation:

(8)

where and are plain words, and their correspondent word embedding vectors, and the angle between them. However, ranges in [-1, 1], so, in order to obtain a distance in the range [0, 1] (so that we can substitute Equation 7 directly in Equation 2), we propose to use the angular distance instead, which is computed as follows:

(9)

So, in Equation 2, we use Equation 9 as distance instead of Equation 7. We use this distance to compute the semantic relatedness between words, between ontology terms (or senses), or between ontology terms and words, obtaining a value between 0 and 1. For those cases in which the label of the ontological term is multi-word, we just compute the centroid of the set of words that form the label. While, at first, it might seem that we limit the coverage of the measure proposed in (Gracia and Mena, 2008) (it built on the results of Web search engines, which potentially cover any domain), we have to bear in mind the plethora of word embedding models directly available in the Web, as well as the possibility of using our own corpus of documents to fine tuning our measure for a particular domain (which is easier to have, rather than crawling the whole WWW).

4. Disambiguation Algorithm

We ground our keyword disambiguation proposal on the disambiguation algorithm defined in (Gracia and Mena, 2009), using the adapted semantic relatedness measure proposed in the previous section. This algorithm is based on the hypothesis that the most significant words in the disambiguation context are the most highly related to the word to disambiguate; such words conform the active context of the word being disambiguated.

As an overview, once the active context of each input keyword has been calculated, the algorithm performs three main steps: 1) obtaining the semantic relatedness between the active context of a keyword and its possible senses, 2) calculating the overlap between the words in the active context and the semantic descriptions (i.e., ontological context) of the possible senses of the keyword to disambiguate, and 3) re-ranking the possible senses according to their frequency of use (only when such information is available for the sense inventory selected888If we do not have such information, we assume that all senses are equally likely.). Apart from using the updated measure to select the active contexts, we propose to modify the second step of this algorithm in order to study the influence of different approaches which exploit the semantic information captured by different word embeddings. In the following subsections, we first detail the original algorithm which we base our proposal on, and, then, we describe the modifications that we propose to improve its performance using word embeddings.

4.1. Background: Algorithm Description

First of all, let us formally introduce the notion of active context. Let  be an element of an input sequence of words  with an intended meaning, be the set of all possible keywords in the input, the set of keywords of the disambiguation context (i.e., the complete disambiguation window considered, e.g., the sentence where the keyword appear), and the target keyword to disambiguate. Thus:

Definition 4.1 ().

Given a context , and a word to disambiguate , the active context of is a subset such that .

In other words, contains the words in the input that are the most related ones to . To obtain such a context, we stick to the method proposed in (Gracia and Mena, 2009): 1) removing repeated words and stopwords from C,  2) applying a semantic relatedness ( in our case) between each context word and the keyword to disambiguate , and  3) constructing with the context words whose relatedness scores above a certain threshold. The output of this process is the active context . The maximum cardinality of is set to a fixed value() following Kaplan’s experiments (Kaplan, 1955).

Once we have obtained for , we can apply the main algorithm, which takes as input , , and a set of possible senses for , . The main steps are presented in Algorithm 1999We refer the interested reader to (Gracia and Mena, 2009) for the complete details.:

Input : 
: The keyword to disambiguate.
: The set of possible senses for .
: The active context selected for .
Output : A weight for each sense .
1 function disambiguate :
2 foreach sense  do
3       foreach keyword  do
4            
5       end foreach
6      
7 end foreach
8
9 foreach sense  do
10      
11 end foreach
12
13 foreach sense  do
14       if  then
15            
16       end if
17      
18 end foreach
Algorithm 1 Keyword disambiguation algorithm
  1. Applying the semantic relatedness: First, the algorithm computes an initial disambiguation between the senses in and the active context (Lines 1-1). For this, we use the updated relatedness measure presented in the previous section (Equations 4 and 9). The score assigned to each sense () is the mean of where is a candidate sense of the keyword being disambiguated, and is a keyword in the active context.

  2. Calculating the context overlap: The disambiguation algorithm weights the scores taking into account the overlap between and the ontological context of each sense, (Lines 1-1). Note that includes its synonyms, glosses, and labels, as well as labels of other related terms, such as hypernyms, hyponyms, meronyms, holonyms, etc. The overlap is calculated (ignoring stopwords) as:

  3. Frequency of usage: Finally, the frequency of use of the highest scored senses is taken into account (Lines 11), if such information is available. The proximity decision is handled by a , which is combined with the maximum of the scores of the senses () to obtain a threshold. The scores of the senses which are above such a threshold are then updated using:

    where is equal to the sum of the frequency of all senses of , and are constrained101010We set and as indicated in (Gracia and Mena, 2009). by and .

The output of the disambiguation algorithm is a score for each possible sense that represents the confidence level of being the right sense according to the active context . Note that, in our approach, is not restricted to any particular dictionary, as it could be dynamically built from, e.g., different ontological resources.

4.2. Proposed Modifications

As our aim is to study the best way to exploit word embeddings, we have analyzed their characteristics and explored different approaches to use them in the adopted disambiguation process. In particular, in this section, we present a list of possible modifications to the Step 2 of the algorithm (Lines 1-1) to include and take advantage of the properties of word embeddings along with the rationale behind them. For the rest of the section, let be the maximum score among all senses in , a function to calculate an average vector by the arithmetic mean of a set of vectors, and the angular distance in Equation 9. Thus, the different approaches are described below:

  • Average: The straightforward way to include the embeddings is to calculate directly the average vector of the different bag of words involved in the disambiguation, under the assumption that the semantically coherent groups of words should outstand from the others. Thus, instead of computing the overlap between the semantic descriptions of each sense and the current active context , we propose to compute the average between the word vectors from and to obtain a new score. Line 1 in Algorithm 1 changes to:

    where is:

    That is, we consider each set of words as a cluster in the vector space, and we represent them by their centroid. If there are elements that do not contribute to the semantic cohesion of the clusters, they will contribute negatively (they will increase the semantic distance) to select a particular sense for the target keyword111111We also studied other cluster-based distances measures (e.g., single linkage), but the results did not improve using the centroid-based measure, so we focused on the average vector which is broadly used in the literature..

  • Sense centroid without most frequent component: As an evolution of the previous method, we studied the method described by Arora et al. (Arora et al., 2017), called Smooth Inverse Frequency (SIF). They propose to represent a sentence by a weighted average vector of its word vectors which the most frequent component using PCA/SVD is substracted from. Thus, we propose to consider the semantic description of all senses of the sense inventory as sentences, and to calculate the SIF embedding of them. Then, during the disambiguation, we compute a new score (Line 1) for the sense being considered by measuring the distance between the centroid of the active context and the SIF vector of each , following this computation:

    Note that we do not substract the SIF vector from as all its words are already deemed as important. The most frequent component vector we are removing may encompass those words that occur most frequently in a corpus and lack semantic content (e.g., stop-words), thus not contributing to the actual disambiguation.

  • Top-K nearest words: As a variant of the two previous methods, in this method, we select the top k nearest words from the semantic description of a sense to . After that, we compute the distance between centroids of the active context and the top K nearest words selected to obtain its new score:

    In this case, we work under the same hypothesis as for the selecing an active context: the words that belong to the semantic description of the sense that are the closest ones to the active context and the keyword that is being disambiguated, should be the most significant to contribute in making a correct disambiguation.

  • Doc2vec: Finally, instead of treating the ontological descriptions as bag of words in this method, we consider them as proper documents and apply doc2vec (Le and Mikolov, 2014). In particular, each semantic description of the senses becomes a document, and doc2vec allows to calculate an embedding space for all of them. Then, we compute the distance between the centroid of the active context and the embedding calculated for the sense. Note that doc2vec learns as well a word embeddings model that it uses during training. We use those word vectors to create the centroid of the active context. Therefore, in a similar way, the new score is computed as:

    We consider the semantic descriptions as documents to capture the distributional semantics both at local (window) and global (document) scope.

We report the best results that we obtained by applying these different approaches in the following section.

5. Experimental Evaluation

In this section, we discuss the results obtained in the experiments that we have carried out to evaluate our proposal. Firstly, we evaluated different available embedding models using the distance proposed in Equation 9. We performed several tests comparing to human judgment in order to check how the angular distance behaved. Secondly, we evaluated the potential of our keyword disambiguation algorithm and the relatedness measure among ontology terms and words in the context of WSD, including all algorithm variations that we have proposed in Section 4.2.

For the experiments, we used the following pre-trained vectors: word2vec trained on Google News corpus121212https://code.google.com/archive/p/word2vec/, word2vec trained on Wikipedia131313https://github.com/jhlau/doc2vec, doc2vec trained also on Wikipedia141414Dump dated in 2015-12-01., GloVe trained on Wikipedia 2014 and Gigaword 5 corpus151515https://nlp.stanford.edu/projects/glove/, and along with the word2vec word embeddings trained on the UMBC corpus161616http://lcl.uniroma1.it/nasari/#two. We used WordNet1 as the sense inventory.

5.1. Correlation with Human Judgment

In order to validate the hypothesis of the suitability of using word embeddings along with the use of the angular distance to compute semantic relatedness, we first analysed the correlation of such a technique with human judgment in a basic word-to-word comparison. For this purpose, we used different datasets available in the literature which contain pairs of words whose relatedness was manually assessed by different people. The datasets and their details can be seen in Table 1.

Dataset #Word Pairs #Human Judges
MC-30 (Miller and Charles, 1991) 30 38
WordS353-Rel (Finkelstein et al., 2001) 252 13
WordS353-Sim (Finkelstein et al., 2001) 203 16
RG-65 (Rubenstein and Goodenough, 1965) 65 51
MEN dataset (train/dev) (Bruni et al., 2014) 2000/1000 crowdsourced171717They used Amazon Mechanical Turk: https://www.mturk.com/
GM dataset (Gracia and Mena, 2008) 30 30
Table 1. Correlation with human judgment benchmarks.

The results obtained for Spearman correlation are presented in Table 2. Reported values, where available, were calculated using the widespread used cosine similarity. We can see that using the angular distance (Equation 9) to calculate relatedness between pairs of words also offers a semantic correlation with the human judgment. In particular, regarding the GM dataset (Gracia and Mena, 2008), the authors reported a 78% using the previous relWeb measure (Equation 7). We can see a strong improvement in this dataset by using word embeddings: we achieve up to a 87.3% using word2vec trained on Google News (taking into account the average of all models, we achieve an average of 81.2% for this dataset). These results enable us to use the angular distance as the core relatedness measure in Equations 1 to 3. Note that for word2vec and doc2vec trained on Wikipedia we can not provide a comparison, because Lau & Baldwin (Lau and Baldwin, 2016) did not evaluate the correlation with human judgment.

Vectors\Datasets MC-30 WS353-Sim WS353-Rel RG-65 MEN GM Average
GloVe 70.4 66.5 56.1 76.9 74.2 84.5 71.4
Reported at Pennington et al. (Pennington et al., 2014) 72.7 65.8 - 77.8 - - 72.1
Google News word2vec 80.0 77.2 63.5 76.0 77.0 87.3 76.8
Reported at Camacho-Collados et al. (Camacho-Collados et al., 2016) 80.0 77.0 - - - - 78.5
70.3 72.7 56.8 70.7 74.5 74.7 70.0
Reported at Camacho-Collados et al. (Camacho-Collados et al., 2016) 83.0 68.0 - 80.0 - - 75.5
Wikipedia word2vec 80.9 77.9 62.2 78.3 76.9 81.8 76.3
Reported at Lau & Baldwin (Lau and Baldwin, 2016) - - - - - - -
Wikipedia doc2vec 73.3 69.0 52.3 71.6 72.0 77.8 69.3
Reported at Lau & Baldwin (Lau and Baldwin, 2016) - - - - - - -
Table 2. Spearman correlation coefficients between the angular distance applied on word pairs and human judgment in different datasets. Upper values are our evaluations, lower ones are the reported values in the original papers using the cosine distance. Highlighted values equal or outperform the best result (78%) for the same dataset in (Gracia and Mena, 2008).

5.2. Word Sense Disambiguation Evaluation

To evaluate our proposal, we used three datasets oriented to WSD: SemCor 2.0 dataset181818http://web.eecs.umich.edu/m̃ihalcea/downloads.html#semcor, SemEval2013 all-words WSD dataset191919https://www.cs.york.ac.uk/semeval-2013/task12.html, and SemEval2015 All-Words Sense Disambiguation and Entity Linking dataset3. We used WordNet1 as sense inventory. For SemCor 2.0, we specifically used WordNet 2.0 as such a dataset is annotated with this version, and for the rest of datasets we used WordNet 3.0.

We tested all the options proposed in Section 4.2 for the disambiguation algorithm, and we obtained that the Top-K nearest words option achieves the best results202020Achieving the best performance for .. Thus, due to space restrictions, we focus on Top-K nearest words option in this section212121The interested reader can find all the details of the experiments at https://bit.ly/2lqCzop. Regarding the models, we selected word2vec trained in Google News and word2vec trained in Wikipedia because they showed better average correlation with human judgment in different datasets (see Table 2); and word embeddings because, although they do not excelled in correlation with human judgment, they showed the best performance in all test datasets for WSD. Finally, in order to compare the results to (Gracia and Mena, 2008)

, we report the precision results for SemCor 2.0; while we report the F-score results for the rest of datasets.

SemCor2.0 Experiments:

Following (Gracia and Mena, 2008), in this set of experiments, for each of three selected highly ambiguous nouns (plant, glass, and earth), we took 10 random sentences from the corpus. Table 3 presents the results: all cases outperform the results achieved in (Gracia and Mena, 2008), which reported an average precision of 57%. Our best performance is an average precision of 63.15% with vectors. In fact, SIF method shows equal or even slightly better performance in this particular dataset using vectors and word2vec trained in Google News vectors. However, in the rest of cases it is Top-K nearest words method that obtains the best results. In addition, SIF method requires to preprocess the target sense inventory to calculate the sentence embeddings, introducing a mild dependence to it. Our selected method shows a good performance (it improves the results of the original algorithm), while allowing to be more decoupled from the actual sense inventory used.

Experiment\Approach Wikipedia word2vec Google News word2vec relWeb* Most Freq. Sense*
10 sent. with PLANT 58.44% 63.03% 66.20% 80% 40%
10 sent. with GLASS 57.47% 63.78% 60.15% 30% 30%
10 sent. with EARTH 59.21% 56.38% 62.33% 60% 60%
AVERAGE 58.41% 61.13% 63.15% 57% 43%
Table 3. Precision results for SemCor 2.0 dataset (10 random sentences) adopting Top-K nearest words. The two rightmost columns show the results using the relWeb based relatedness measure and Most Frequent Sense methods as reported in (Gracia and Mena, 2008).

SemEval Results:

In Table 4, we present the results obtained for SemEval 2013 and SemEval 2015. In this case, vectors achieved the best results, with an F-score of 64.39%. In SemEval 2013, UMCC-DLSI reported the best results, with a F-score of 64.7%, similar to ours. Besides, our results are similar to other state-of-the-art systems using sense embeddings: Camacho et al. (Camacho-Collados et al., 2016) reported an F-Score value of 66.7% with their vectors evaluated in SemEval 2013. Unfortunately, they do not provide results for their vectors using WordNet. Regarding SemEval 2015, the best reported result in our task reached an F-score of 65.8%, while the baseline used to compare systems (BabelNet First Sense (BFS)) was an F-score of 67.5%. We reach an F-score value of 61.61%, which, while does not beat previous values, is meritory given that our approach is focused on situations where linguistic information might be scarce (e.g., keyword-based input).

Wikipedia word2vec Google News word2vec Best system Baseline
SemEval 2013 59.59% 62.81% 64.39% 66.7% 63.0%
SemEval 2015 61.32% 60.37% 61.61% 65.8% 67.5%
Table 4. F-Score results for SemEval 2013 and 2015 datasets adopting Top-K nearest words. Values in column highlighted by correspond to the best system in each SemEval dataset. Values in column highlighted by are the baselines reported in SemEval 2013 and SemEval 2015.

To sum up, our proposal improves the results presented in (Gracia and Mena, 2008) by substituting their Web search engine-based measure to one that uses word embeddings. We also improve the disambiguation results reported in (Gracia and Mena, 2009) by adapting their algorithm to exploit the properties of the word embeddings. Our proposal achieves similar performance levels to the SOTA, while providing the flexibility to work independently of the resources used (i.e., word embeddings, sense inventory), and reducing the barriers to its application to any domain.

6. Conclusions and future work

In this paper, we have presented a keyword disambiguation approach based on a semantic relatedness measure which exploits the semantic information captured by word embeddings and tries to map the meanings from a sense inventory. We have visited the semantic relatedness measure proposed in (Gracia and Mena, 2008) to adapt it to work with word embeddings instead of relying on Web search engines, and we have improved a disambiguation algorithm (Gracia and Mena, 2009) by exploring different uses and types of embeddings (both at word and sentence level).

To validate our proposal, we have performed several experiments around Word Sense Disambiguation (WSD) where we have used different pre-trained word embeddings and WordNet as the resource to obtain the target senses of words. With our proposal:

  • We are able to relate words and meanings from a sense inventory (e.g., ontology terms) in a flexible way, by exploiting available resources and regardless the domain in which we are working. This makes our measure adaptive and general enough to be used for different contexts.

  • We provide a method which can be adapted to any domain in a dictionary-decoupled way, provided that we have a document corpus which would allow us to capture the distributional semantics. This lowers the requirements of data in order to build more specific models for particular domains.

  • We have tested the capabilities of different word embedding models, improving the results presented in (Gracia and Mena, 2008). We evaluated our measure in the same SemCor 2.0 dataset described in this work, and we have obtained in the best case an average increase of 6% in precision (a relative improvement of about 11%).

  • Being decoupled from a fixed pool of senses does not come at the expense of performance. We achieve similar quality of results than having an ad hoc and more expensive trained model capturing the possible senses. In particular, we have tested our measure in SemEval 2013 and SemEval 2015 datasets reaching an F-score of 64.39% and 61.61% respectively. These results are similar to the state of the art (Camacho-Collados et al., 2016) using sense2vec approaches.

As future work, we want to further extend our approach to the field of concept discovery (similar to entity search (Balog, 2018), but focused on concepts rather than on instances). We also want to explore newer contextualized word embeddings, such as BERT or XLNet (based in ELMo (Peters et al., 2018)), and how they could be used in this context. Finally, we would like to propose a specific dataset exclusively for keyword disambiguation taking QALD222222QALD is a series of evaluation campaigns on Question Answering over Linked Data: (http://qald.aksw.org). datasets as baseline; we want to develop it in order to test our relatedness measure in a more appropriate dataset for the context in which we focus: the disambiguation of keyword-based inputs.

References

  • S. Arora, Y. Liang, and T. Ma (2017) A simple but tough-to-beat baseline for sentence embeddings. In Proc. of Intl. Conf. on Learning Representations (ICLR’17), pp. 1–16. Cited by: 2nd item.
  • K. Balog (2018) Encyclopedia of database systems. pp. 1326–1331. Cited by: §6.
  • C. Barr, R. Jones, and M. Regelson (2008) The linguistic structure of english web-search queries. In Proc. of Conf. on Empirical Methods in Natural Language Processing (EMNLP’08), pp. 1021–1030. Cited by: §1.
  • Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model.

    Journal of Machine Learning Research

    3 (6), pp. 1137–1155.
    Cited by: §1, §2.
  • C. Bobed and E. Mena (2016) QueryGen: semantic interpretation of keyword queries over heterogeneous information systems. Information Sciences 329, pp. 412–433. Cited by: §1.
  • E. Bruni, N. K. Tran, and M. Baroni (2014) Multimodal distributional semantics.

    Journal of Artificial Intelligence Research

    49, pp. 1–47.
    Cited by: Table 1.
  • A. Budanitsky and G. Hirst (2006) Evaluating wordnet-based measures of semantic distance. Computational Linguistics 32 (1), pp. 13–47. Cited by: §1, §2.
  • J. Camacho-Collados, M. T. Pilehvar, and R. Navigli (2016) NASARI: integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artificial Intelligence 240, pp. 36–64. Cited by: §1, §2, §2, §5.2, Table 2, 4th item.
  • J. Camacho-Collados and M. T. Pilehvar (2018) From word to sense embeddings: a survey on vector representations of meaning. Journal of Artificial Intelligence Research 63 (1), pp. 743–788. Cited by: §2.
  • D. S. Chaplot and R. Salakhutdinov (2018) Knowledge-based word sense disambiguation using topic models. In Proc. of AAAI Conf. on Artificial Intelligence (AAAI’18), pp. 5062–5069. Cited by: §2.
  • R. L. Cilibrasi and P. M. B. Vitanyi (2007) The google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19 (3), pp. 370–383. Cited by: §2, §3.
  • E. A. Correa Jr, A. A. Lopes, and D. R. Amancio (2018) Word sense disambiguation: a complex network approach. Information Sciences 442, pp. 103–113. Cited by: §2.
  • L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin (2001) Placing search in context: the concept revisited. In Proc. of Intl. Conf. on World Wide Web (WWW’01), pp. 406–414. Cited by: Table 1.
  • J. Gracia and E. Mena (2008) Web-based measure of semantic relatedness. In Proc. of Intl. Conf. on Web Information Systems Engineering (WISE’08), pp. 136–150. Cited by: §1, §1, §2, §2, §3, §3, §3, §5.1, §5.2, §5.2, §5.2, Table 1, Table 2, Table 3, 3rd item, §6, footnote 7.
  • J. Gracia and E. Mena (2009) Multiontology semantic disambiguation in unstructured web contexts. In Proc. of Workshop on Collective Knowledge Capturing and Representation (CKCaR’09) at K-CAP’09, pp. 1–9. Cited by: §1, §2, §2, §4.1, §4, §5.2, §6, footnote 10, footnote 9.
  • I. Iacobacci, M. T. Pilehvar, and R. Navigli (2016) Embeddings for word sense disambiguation: an evaluation study. In Proc. of Annual Meeting of the Association for Computational Linguistics (ACL’16), pp. 897–907. Cited by: §2.
  • A. Kaplan (1955) An experiment study of ambiguity and context. Mechanical Translation 2 (1), pp. 39–46. Cited by: §4.1.
  • T. K. Landauer, P. W. Foltz, and D. Laham (1998) An introduction to latent semantic analysis. Discourse Processes 25 (2-3), pp. 259–284. Cited by: §2.
  • J. J. Lastra-Díaz, J. Goikoetxea, M. A. H. Taieb, A. García-Serrano, M. B. Aouicha, and E. Agirre (2019) A reproducible survey on word embeddings and ontology-based methods for word similarity: linear combinations outperform the state of the art. Engineering Applications of Artificial Intelligence 85, pp. 645–665. Cited by: §2.
  • J. H. Lau and T. Baldwin (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proc. of Workshop on Representation Learning for NLP (Rep4NLP’16), pp. 78–86. Cited by: §5.1, Table 2.
  • Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In Proc. of Intl. Conf. on Machine Learning (ICML’14), pp. 1188–1196. Cited by: §1, §2, 4th item.
  • M. Mancini, J. Camacho-Collados, I. Iacobacci, and R. Navigli (2017) Embedding words and senses together via joint knowledge-enhanced training. In Proc. of Conf. on Computational Natural Language Learning (CoNLL’17), pp. 100–111. Cited by: §1, §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119. Cited by: §1, §2.
  • G. A. Miller and W. G. Charles (1991) Contextual correlates of semantic similarity. Language and Cognitive Processes 6 (1), pp. 1–28. Cited by: Table 1.
  • S. M. Mohammad and G. Hirst (2012) Distributional measures of semantic distance: a survey. arXiv preprint arXiv:1203.1858. Cited by: §2.
  • D. Movshovitz-Attias, S. E. Whang, N. Noy, and A. Halevy (2015) Discovering Subsumption Relationships for Web-Based Ontologies. In Proc. of Intl. Workshop on Web and Databases (WebDB’15), pp. 62–6962–69. Cited by: §1.
  • R. Navigli (2009) Word sense disambiguation: A survey. ACM Computing Surveys 41 (2), pp. 1–69. Cited by: §1.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proc. of Conf. on Empirical Methods in Natural Language Processing (EMNLP’14), pp. 1532–1543. Cited by: §1, §2, Table 2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18), pp. 2227–2237. Cited by: §6.
  • Y. Pinter, R. Reichart, and I. Szpektor (2016) Syntactic parsing of web queries with question intent. In Proc. of Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16), pp. 670–680. Cited by: §1.
  • G. Pirró (2012) REWOrD: semantic relatedness in the web of data. In Proc. of AAAI Conf. on Artificial Intelligence (AAAI’12), pp. 129–135. Cited by: §2.
  • R. S. Roy, S. Agarwal, N. Ganguly, and M. Choudhury (2016) Syntactic complexity of web search queries through the lenses of language models, networks and users. Information Processing & Management 52 (5), pp. 923–948. Cited by: §1.
  • H. Rubenstein and J. B. Goodenough (1965) Contextual correlates of synonymy. Communications of the ACM 8 (10), pp. 627–633. Cited by: Table 1.
  • M. Sahlgren (2008) The distributional hypothesis. Italian Journal of Disability Studies 20 (1), pp. 33–53. Cited by: §1.
  • R. Socher, J. Bauer, C. Manning, and A. Ng (2013a) Parsing with compositional vector grammars. In Proc. of Annual Meeting of the Association for Computational Linguistics (ACL’13), pp. 455–465. Cited by: §1.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, and C. Potts (2013b) Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. of Conf. on Empirical Methods in Natural Language Processing (EMNLP’13), pp. 1631–1642. Cited by: §1.
  • L. Vial, B. Lecouteux, and D. Schwab (2018) Improving the coverage and the generalization ability of neural word sense disambiguation through hypernymy and hyponymy relationships. arXiv preprint arXiv:1811.00960. Cited by: §2.
  • D. Yuan, J. Richardson, R. Doherty, C. Evans, and E. Altendorf (2016) Semi-supervised word sense disambiguation with neural models. In Proc. of Intl. Conf. on Computational Linguistics (COLING’16), pp. 1374–1385. Cited by: §2.