Ukb: graph-based WSD and similarity
Hyperlinks and other relations in Wikipedia are a extraordinary resource which is still not fully understood. In this paper we study the different types of links in Wikipedia, and contrast the use of the full graph with respect to just direct links. We apply a well-known random walk algorithm on two tasks, word relatedness and named-entity disambiguation. We show that using the full graph is more effective than just direct links by a large margin, that non-reciprocal links harm performance, and that there is no benefit from categories and infoboxes, with coherent results on both tasks. We set new state-of-the-art figures for systems based on Wikipedia links, comparable to systems exploiting several information sources and/or supervised machine learning. Our approach is open source, with instruction to reproduce results, and amenable to be integrated with complementary text-based methods.READ FULL TEXT VIEW PDF
Wikipedia articles contain multiple links connecting a subject to other ...
Wikidata is steadily becoming more central to Wikipedia, not just in
Named Entity Disambiguation (NED) is the task of linking a named-entity
A growing body of work has highlighted the important role that Wikipedia...
The Wikipedia category graph serves as the taxonomic backbone for large-...
We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a
This paper presents an automated supervised method for Persian wordnet
Ukb: graph-based WSD and similarity
Hyperlinks and other relations between concepts and instances in Wikipedia have been successfully used in semantic tasks [Milne and Witten2013]. Still, many questions about the best way to leverage those links remain unanswered. For instance, methods using direct hyperlinks alone would wrongly disambiguate Lions in Figure 1 to B&I_Lions, a rugby team from Britain and Ireland, as it shares two direct links to potential referents in the context (Darrel Fletcher, a British football player, and Cape Town, the city where the team suffered some memorable defeats), while Highveld_Lions, a cricket team from South Africa, has only one. When considering the whole graph of hyperlinks we find that the cricket team is related to two cricketers named Alan Kourie and Duncan Fletcher and could thus pick the right entity for Lions in this context. In this paper we will study this and other questions about the use of hyperlinks in word relatedness [Gabrilovich and Markovitch2007] and named-entity disambiguation, NED [Hachey et al.2012].
Previous work on this area has typically focused on novel algorithms which work on a specific mix of resource, information source, task and test dataset (cf. Sect. 7). In the case of NED, the evaluation of the disambiguation component is confounded by interactions with mention spotting and candidate generation. With very few exceptions, there is little analysis of components and alternatives, and it is very difficult to learn any insight beyond the fact that the mix under study attained certain performance on the target dataset111See [Hachey et al.2012] and [García et al.2014] for two exceptions on NED. The first is limited to a single dataset, the second explores methods based on direct links, which we extend to using the full graph.. The number of algorithms and datasets is growing by the day, with no well-established single benchmark, and the fact that some systems are developed on test data, coupled with reproducibility problems [Fokkens et al.2013, on word relatedness], makes it very difficult to know where the area stands. There is a need for clear points of reference which allow to understand where each information source and algorithm stands with respect to other alternatives.
We thus depart from previous work, seeking to set such a point of reference, and focus on a single knowledge source (hyperlinks in Wikipedia) with a clear research objective: given a well-established random walk algorithm (Personalized PageRank [Haveliwala2002]) we explore sources of links and filtering methods, and contrast the use of the full graph with respect to using just direct links. We follow a clear development/test/analysis methodology, evaluating on a extensive range of both relatedness and NED datasets. The results are confirmed in both tasks, yielding more support to the findings in this research. All software and data are publicly available, with instructions to obtain out-of-the-box replicability222http://ixa2.si.ehu.es/ukb/README.wiki.txt.
The contributions of our research are the following: (1) We show for the first time that performing random walks over the full graph is preferable than considering only direct links. (2) We study several sources of links, showing that non-reciprocal links hurt and that the contribution of the category structure and links in infoboxes is residual. (3) We set the new state-of-the-art for systems based on Wikipedia links for both word relatedness and named-entity disambiguation. The results are close to the best systems to date, which use several information sources and/or supervised machine learning techniques, and specialize on either relatedness or disambiguation. Our work shows that a careful analysis of varieties of graphs using a well-known random walk algorithm pays off more than most ad-hoc algorithms.
The article is structured as follows. We first present previous work, followed by the different options to build hyperlink graphs. Sect. 4 reviews random walks for relatedness and NED. Sect. 5 sets the experimental methodology, followed by the analysis and results on development data (Sect. 6) and the comparison to the state of the art (Sect. 7). Finally, Sect. 8 draws the conclusions.
The irruption of Wikipedia has opened up enormous opportunities for natural language processing[Hovy et al.2013], with many derived knowledge-bases, including DBpedia [Bizer et al.2009], Freebase [Bollacker et al.2008], and BabelNet [Navigli and Ponzetto2012a], to name a few. These resources have been successfully used on semantic processing tasks like word relatedness, named-entity disambiguation (NED), also known as entity linking, and the closely related Wikification. Broadly speaking, Wikipedia-based approaches to those tasks can be split between those using the text in the articles (e.g., Gabrilovich and Markovitch, 2007) and those using the links between articles (e.g., Guo et al., 2011).
Relatedness systems take two words and return a high number if the two words are similar or closely related333Relatedness is more general than similarity. For the sake of simplicity, we will talk about relatedness on this paper. (e.g. professor - student), and a low number otherwise (e.g. professor - cucumber). Evaluation is performed comparing the returned values to those by humans [Rubenstein and Goodenough1965].
In NED [Hachey et al.2012] the input is a mention of a named-entity in context and the output is the appropriate instance from Wikipedia, DBpedia or Freebase (cf. Figure 1). Wikification is similar [Mihalcea and Csomai2007], but target terms include common nouns and only relevant terms are disambiguated. Note that the disambiguation component in Wikification and NED can be the same.
Our work focuses on relatedness and NED. We favored NED over Wikification because of the larger number of systems and evaluation datasets, but our conclusions are applicable to Wikification, as well as other Wikipedia-derived resources.
In this section we will focus on previous work using Wikipedia links for relatedness, NED and Wikification. Although relatedness and disambiguation are closely related (relatedness to context terms is an important disambiguation clue for NED), most of the systems are evaluated in either relatedness or NED, with few exceptions, like WikiMiner [Milne and Witten2013], KORE [Hoffart et al.2012] and the one presented in this paper.
Milne and Witten Milne08aneffective are the first to use hyperlinks between articles for relatedness. They compare two articles according to the number of incoming links that they have in common (i.e. overlap of direct-links) based on Normalized Google Distance (NGD), combined with several heuristics and collocation strength. In later work[Milne and Witten2013], they incorporated machine learning. The authors also apply their technique to NED [Milne and Witten2008b]
, using their relatedness measures to train a supervised classifier. Unfortunately they do not present results of their link-based method alone, so we decided to reimplement it (cf. Sect.6). We show that, under the same conditions, using the full-graph is more effective in both tasks. We also run their out-of-the-box system444https://sourceforge.net/projects/wikipedia-miner/ on the same datasets as ours (cf. Sect. 7), with results below ours.
Apart from hyperlinks between articles, other works on relatedness use the category structure [Strube and Ponzetto2006, Ponzetto and Strube2007, Ponzetto and Strube2011] to run path-based relatedness algorithms which had been successful on WordNet [Pedersen et al.2004], or use relations in infoboxes [Nastase and Strube2013]. In all cases, they obtain performance figures well below hyperlink-based systems (cf. Sect. 7). We will explore the contribution of such relations (cf. Sect. 3), incorporating them to the hyperlink graph.
Attempts to use the whole graph of hyperlinks for relatedness have been reported before. Yeh et al. yeh-EtAl:2009:TextGraphs4 obtained very low results on relatedness using an algorithm based on random walks similar to ours. Similar in spirit, Yazdani and Popescu-Belis Yazdani:2013:CTS:2405838.2405916 built a graph derived from the Freebase Wikipedia Extraction dataset, which is derived but richer than Wikipedia. Even if they mix hyperlinks with textual similarity, their results are lower than ours. One of the key differences with these systems is that we remove non-reciprocal links (cf. Sect. 3).
Regarding link-based methods for NED, there is only one system which relies exclusively on hyperlinks. Guo et al. guo_graph-based_2011 use direct hyperlinks between the target entity and the mentions in the context, counting the number of such links. We show that the use of the full graph produces better results.
The rest of NED systems present complex combinations. Lemahnn et al. Lehmann2010 present a supervised system combining features based on hyperlinks, categories, text similarity and relations from infoboxes. Despite their complex and rich system, we will show that they perform worse than our system. Hachey:2011:GNE:2050963.2050980 explored hyperlinks beyond direct links for NED, building subgraphs for each context using paths of length two departing from the context terms, combined with text-based relatedness. We will show that the full graph is more effective than limiting the distance to two, and report better results than their system. Several authors have included direct links using the aforementioned NGD in their combined systems [Ratinov et al.2011, Hoffart et al.2011]. Unfortunately, they do no report separate results for the NGD component. In very recent work DBLP:journals/jair/GarciaAF14 compare NGD with several other algorithms using direct links, but do not explore the full graph, or try to characterize links. We will see that their results are well below ours (cf. Sect. 7).
Graph-based algorithms for relatedness and disambiguation have been successfully used on other resources, particularly WordNet. Hughes and Ramage hughes2007emnlp were the first presenting a random walk algorithm over the WordNet graph. Agirre et al. AGIRRE10.534 improved over their results using a similar random walk algorithm on several variations of WordNet relations, reporting the best results to date among WordNet-based algorithms. The same algorithm was used for word sense disambiguation [Agirre et al.2014], also reporting state-of-the-art results. We use the same open source software in our experiments. As an alternative to random walks, Tsatsaronis et al. DBLP:journals/jair/TsatsaronisVV10 use a path-based system over the WordNet relation graph.
In more recent work [Navigli and Ponzetto2012b, Pilehvar et al.2013], the authors present two relatedness algorithms for BabelNet, an enriched version of WordNet including articles from Wikipedia, hyperlinks and cross-lingual relations from non-English Wikipedias. In related work, Moro et al. Moro:2014:ELmeetsWSD present a multi-step NED algorithm on BabelNet, building semantic graphs for each context. We will show that Wikipedia hyperlinks alone are able to provide similar performance on both tasks.
Wikipedia pages can be classified into main articles, category pages, redirects and disambiguation pages. Given a Wikipedia dump (a snapshot from April 4, 2013), we mine links between articles, between articles and category pages, as well as the links between category pages (the category structure). Our graphs include a directed edge from one article to another iff the text of the first article contains a hyperlink to the second article. In addition, we also include hyperlinks in infoboxes.
The graph contains two types of nodes (articles and categories) and three types of directed edges: hyperlinks from article to article (H), infobox links from article to article (I), links from article to category and links from category to category (C).
We constructed several graphs using different combinations of nodes and edges. In addition to the directed versions (d) we also constructed an undirected version (u), and a reduced graph which only contains links which are reciprocal (r), that is, we add a pair of edges between and if and only if there exists a hyperlink from to and from to . Reciprocal links capture the intuition that both articles are relevant to each other, and tackle issues with links to low relevance articles, e.g. links to articles on specific years like 1984. Some authors weight links according to their relevance [Milne and Witten2013]. Our heuristic to keep only reciprocal links can be seen as a simpler, yet effective, method to avoid low relevance links.
Table 1 gives the number of nodes and edges in some selected graphs. The graph with less edges is the one with reciprocal hyperlinks Hr, and the graphs with most edges are those with undirected edges, as each edge is modeled as two directed edges555This was done in order to combine undirected and reciprocal edges, and could be avoided in other cases.. The number of nodes is similar in all, except for the infobox graphs (infoboxes are only available for a few articles), and the reciprocal graph Hr, as relatively few nodes have reciprocal edges.
Partial view of dictionary entry for “gotham”. The probability is calculated as the ratio between the frequency and the total count.
In order to link running text to the articles in the graph, we use a dictionary, i.e., a static association between string mentions with all possible articles the mention can refer to.
We built our dictionary from the same Wikipedia dump, using article titles, redirections, disambiguation pages, and anchor text. Mention strings are lowercased and all text between parentheses is removed. If an anchor links to a disambiguation page, the text is associated with all possible articles the disambiguation page points to. Each association between a mention and article is scored with the prior probability, estimated as the number of times that the mention occurs in an anchor divided by the total number of occurrences of the mention as anchor. Note that our dictionary can disambiguate any mention, just returning the highest-scoring article. Table2 partially shows a sample entry in our dictionary.
Sample of the probability distribution returned byppr for two words. Top five articles shown.
The PageRank random walk algorithm [Brin and Page1998] is a method for ranking the vertices in a graph according to their relative structural importance. PageRank can be viewed as the result of a random walk process, where the final rank of node represents the probability of a random walk over the graph ending on node , at a sufficiently large time.
Personalized PageRank (ppr) is a variation of PageRank [Haveliwala2002]
, where the query of the user defines the importance of each node, biasing the resulting PageRank score to prefer nodes in the vicinity of the query nodes. The query bias is also called the teleport vector.ppr has been successfully used on the WordNet graph for relatedness [Hughes and Ramage2007, Agirre et al.2010] and WSD [Agirre and Soroa2009, Agirre et al.2014]. In our experiments we use UKB version 2.1666http://ixa2.si.ehu.es/ukb, an open source software for relatedness and disambiguation based on ppr. For the sake of space, we will skip the details, and refer the reader to those papers. ppr has two parameters: the number of iterations, and the damping factor, which controls the relative weight of the teleport vector.
Given a dictionary and graph derived from Wikipedia (cf. Sect. 3), ppr expects a set of mentions, i.e., a set of strings which can be linked to Wikipedia articles via the dictionary. The method first initializes the teleport vector: for each mention in the input, the articles in the respective dictionary entry are set with an initial probability, and the rest of articles are set to zero. We explored two options to set the initial probability of each article: the uniform probability or the prior probability in the dictionary. When an article appears in the dictionary entry for two mentions, the initial probability is summed up. In a second step, we apply ppr for a number of iterations, producing a probability distribution over Wikipedia articles in the form of a ppr vector (ppv).
The probability vector can be used for both relatedness and NED. For relatedness we produce a ppv vector for each of the words to be compared, using the single word as input mention. The relatedness between the target words is computed as the cosine between the respective ppv vectors. In order to speed up the computation, we can reduce the size of the ppv vectors, setting to zero all values below rank after ordering the values in decreasing order.
Table 3 shows the top 5 articles in the ppv vectors of two sample words. The relatedness between pairs Drink and Alcohol would be non-zero, as their respective vectors contain common articles.
For NED the input comprises the target entity mention and its context, defined as the set of mentions occurring within a 101 token window centered in the target. In order to extract mentions to articles in Wikipedia from the context, we match the longest strings in our dictionary as we scan tokens from left to right. We then initialize the teleport probability with all articles referred by the mentions. After computing Personalized PageRank, we output the article with highest rank in ppv among the possible articles for the target entity mention. Figure 1 shows an example of NED.
If the prior is being used to initialize weights, we multiply the prior probability with the Pagerank probabilities before computing the final ranks. In the rare cases777Less than 3% of instances. where no known mention is found in the context, we return the node with the highest prior.
Note that our NED and relatedness algorithms are related. NED is using using relatedness, as Pagerank probabilities are capturing how related is each candidate article to the context of the mention. Following the first-order and second-order co-occurrence abstraction [Islam and Inkpen2006, Agirre and Edmonds2007, Ch. 6], we can interpret that we do NED using first-order relatedness, while our relatedness uses second-order relatedness.
We summarize the datasets used in Table 4. RG, MC and 353 are the most used relatedness datasets to date, with TSA and KORE being more recent datasets where some top-ranking systems have been evaluated. Word relatedness datasets were lemmatized and lowercased, except for KORE, which is an entity relatedness dataset where the input comprises article titles888We had to manually adjust the articles in KORE, as the exact title depends on the Wikipedia version. We missed 3 for our 2013 version, which could slightly degrade our results. . Following common practice rank-correlation (Spearman) was used for evaluation.
|RG||[Rubenstein and Goodenough1965]||65|
|MC||[Miller and Charles1991]||30|
|353||[Gabrilovich and Markovitch2007]||353|
|TSA||[Radinsky et al.2011]||287|
|KORE||[Hoffart et al.2012]||420|
|TAC09||[McNamee et al.2010]||1675|
|AIDA||[Hoffart et al.2011]||4401|
|KORE||[Hoffart et al.2012]||143|
Regarding NED, the TAC Entity Linking competition is held annually. Due to its popularity it is useful to set the state of the art. We selected the datasets in 2009 and 2010, as they have been used to evaluate several top ranking systems, as well as the 2013 dataset, which is the most recent. In addition, we also provide results for AIDA, the largest and only dataset providing annotations for all entities in the documents, and KORE, a recent, very small dataset focusing on difficult mentions and short contexts. Evaluation was performed using accuracy, the ratio between correctly disambiguated instances and the total number of instances that have a link to an entity in the knowledge base999Corresponds to non-NIL accuracy at TAC-KBP (also called KB accuracy) and Micro P@1.0 in [Hoffart et al.2011]. Each dataset uses a different Wikipedia version, but fortunately Wikipedia keeps redirects from older article titles to the new version. As customary in the task, we automatically map the articles returned by our system to the version used in the gold standard.
Following standard practice in NED, we do not evaluate mention detection101010See [Cornolti et al.2013] for a framework to evaluate both mention detection and disambiguation., that is, the datasets already specify which are the target mentions. Note that TAC provides so called “queries” which can be substrings of the full mention, e.g. “Smith” for a mention like “John Smith”). Given a mention, we devised the following heuristics to improve candidate generation: (1) remove substring contained in parenthesis from the mention, then check dictionary, (2) if not found, remove “the” if first token in the mention, then check dictionary, (3) if not found, remove middle token if mention contains three tokens, then check dictionary, (4) if not found, search for a matching entity using the Wikipedia API111111http://en.wikipedia.org/w/api.php. The heuristics provide an improvement of around 4 points on development. Later analysis showed that these heuristics seem to be only relevant on the TAC datasets, because of the way the query strings are designed, but not on AIDA or KORE.
We wanted to follow a standard experimental design, with a clear development/test split for each task. Unfortunately there is no standard split in the literature, and the choice is difficult: The development dataset should be representative enough to draw conclusions on different alternatives and parameters, but at the same time the most relevant datasets in the literature should be left for testing, in order to have enough points for comparison. In addition, some recent algorithms suposedly setting the state of the art are only tested on newly produced datasets. Note also that relatedness datasets are small, making it difficult to find statistically significant differences.
In order to strike a balance between the need for in-depth analysis and fair comparison to previous results, we decided to focus on the two oldest datasets from each task for development and analysis: RG for relatedness and a subset of 200 polysemic instances from TAC09 for NED (TAC09)121212The dataset in http://ixa2.si.ehu.es/ukb/README.wiki.txt includes the subset.. The rest will be used for test, where the parameters have been set on development. Given the need for significant conclusions, we re-checked the main conclusions drawn from development data using the aggregation of all test datasets, but only after the comparison to the state of the art had been performed. This way we ensure both a fair comparison with the state of the art and a well-grounded analysis.
We performed significance tests using Fisher’s z-transformation for relatedness[Press et al.2002, equation 14.5.10], and paired bootstrap resampling for NED [Noreen1989], accepting differences with p-value . Given the small size of the datasets, when necessary, we also report statistical significance when joining all datasets as just mentioned.
In this section we study the performance of the different graphs and parameters on the two development datasets, RG and TAC09. The next section reports the results on the test sets for the best parameters, alongside state-of-the-art system results.
As mentioned in Sect. 4.1, ppr has several parameters and variants (cf. Figure 2). We first checked exhaustively all possible combinations for different graphs, with the rest of parameters set to default values. We then optimized each of the parameters in turn, seeking to answer the following questions:
Which links help most? Table 1 shows the results for selected graphs. The first seven rows present the results for each edge source in isolation, both using directed and undirected edges. Categories and infoboxes suffer from producing smaller graphs, with the hyperlinks yielding the best results. The undirected versions improve over directed links in all cases, with the use of reciprocal edges for hyperlinks obtaining the best results overall (the graphs with reciprocal edges for categories and infoboxes were too small and we omit them). The trend is the same in both relatedness and NED, highlighting the robustness of these results.
Regarding combined graphs, we report the most significant combinations. The reciprocal graph of hyperlinks outperforms all combinations (including the combinations which were omitted), showing that categories and infoboxes do not help or even degrade slightly the results. The differences are statistically significant (either on the individual datasets or in the aggregation on all datasets) in all cases, confirming that Hr is significantly better.
The degradation or lack of improvement when using infoboxes is surprising. We hypothesized that it could be caused by non-reciprocal links in HrIu. In fact, removing non-reciprocal links from HrIu improved results slightly on NED, matching those of Hr. This lack of improvement with infoboxes, even when removing non-reciprocal links, can be explained by the fact that only 5% of reciprocal links in Iu are not in Hr. It seems that this additional 5% is not helping in this particular dataset. Regarding categories, the category structure is mostly a tree, which is a structure where random walks do not seem to be effective, as already observed in [Agirre et al.2014] for WordNet.
Is initialization of random walks important? The second row in Table 5
reports the result when using uniform distributions when initializing the random walks (instead of prior probabilities). The results degrade in both datasets, the difference being significant only for NED. This was later confirmed in the rest of relatedness and NED datasets: using prior probabilities for initialization improves results in all cases, but it is only significant in NED datasets. These results show that relatedness is less sensitive to changes in the distribution of meanings, that is, using the more informative prior distributions of meaning only improves results slightly. NED, on the contrary, is more sensitive, as the distribution of senses affects dramatically the performance.
Is the value of and important? The best on both datasets was obtained with default values (cf. Table 5), in agreement with related work using WordNet [Agirre et al.2010]. The lowest number of iterations where convergence was obtained were 30 and 15, respectively, although as few as 5 iterations yielded very similar performance (87.1 on relatedness, 68.0 on NED).
Is the size of the vector, , important for relatedness? The best performance was attained for the default , with minor variations for .
|Hr||ppr (1 iter.)||43.4||60.5|
|Hr||ppr (2 iter.)||78.3||66.0|
Is the full graph helping? When the ppr algorithm does a single iteration, we can interpret that it is ranking all entities using direct links. When doing two iterations, we can loosely say that it is using links at distance two, and so on. Table 6 shows that ppr is able to take profit from the full graph well beyond 2 iterations, specially in relatedness. These results were confirmed in the full set of datasets, with statistically significant differences in all cases.
In addition, we reimplemented the relatedness and NED algorithms based on NGD over direct links [Milne and Witten2008a, Milne and Witten2008b], allowing to compare them to ppr on the same experimental conditions. We first developed the relatedness algorithm131313In order to replicate the NGD relatedness algorithm, we checked the open source code available, exploring the use of inlinks and outlinks and the use of maximum pairwise article relatedness. We also realized that the use of priors (“commonness” according to the terminology in the paper) was hurting, so we dropped it. We checked both reciprocal and unidirectional versions of the hyperlink graph, with better results for the reciprocal graph.. Table 6 reports the best variant, which outperforms the 0.64 on RG reported in their paper. We followed a similar methodology for NED141414We checked both reciprocal and undirected graphs with similar results, combined with prior (similar results), weighted terms in the context (with improvement) and checked the use of ambiguous mentions in the context (marginal improvement). Reported results correspond to reciprocal, combination with prior, weighting terms and using only monosemous mentions. . Table 6 shows the results for NGD, which performs worse than ppr. This trend was confirmed on the full set of datasets for relatedness and NED with statistical significance in all cases except KORE, which is the smallest NED dataset. Figure 1 illustrates why the use of longer paths is beneficial. In fact, NGD returns 0.14 for B&I_Lions and 0.13 for Highveld_Lions, but ppr correctly returns 0.05 and 0.75, respectively.
|[Ponzetto and Strube2011]||Wiki11||c||75.0*|
|[Nastase and Strube2013]||Wiki13||ci||67.0|
|[Milne and Witten2013]||Wiki13||la||69.5r||59.7r||35.8r||77.2r||65.9r|
|[Yeh et al.2009]||Wiki09||g||48.5|
|ppr default Hr||Wiki13||g||0||88.4*||1||72.8||1||64.1||1||81.0||1||66.2|
|[Agirre et al.2010]||WNet||g||1||86.2r||68.5||45.4r||3||85.2r|
|[Tsatsaronis et al.2010]||WNet||g||86.1||61.0|
|[Navigli and Ponzetto2012b]||WNet+Wiki12 (cl)||g+CL||65.0||1||90.0|
|[Pilehvar et al.2013]||WNet+Wiki13||g||86.8*|
|ppr default Hr||Wiki13||g||0||88.4*||2||72.8||1||64.1||4||81.0||1||66.2|
|ppr default Hr||WNet+Wiki13||g||0||91.8*||1||78.5||2||62.9||2||87.6||1||66.2|
|[Gabrilovich and Markovitch2007]||Wiki07||t||82.0||75.0||59.0||73.0|
|[Hoffart et al.2012]||Wiki12||t||0||69.8*|
|[Yazdani and Popescu-Belis2013]||Freebase||gt||70.0*|
|[Radinsky et al.2011]||Time||C||1||80.0||1||63.0|
|[Baroni et al.2014]||Corpus||C||84.0*||71.0|
|[Agirre et al.2009]||WNet+Corpus||Cg+SUP||0||96.0x||78.0x|
|[Milne and Witten2013]||Wiki13||la+SUP||83.5r||74.0x||52.8r||81.3r||1||66.5r|
|ppr default Hr||WNet+Wiki13||g||0||91.8*||2||78.5||2||62.9||2||87.6||2||66.2|
|[Guo et al.2011]||Wiki10||l||1||74.0||74.1|
|[Milne and Witten2013]||Wiki13||la||57.4r||58.5r||37.1r||56.0r||35.7r|
|[García et al.2014]||Wiki12||l||76.6|
|ppr default Hr||Wiki13||g||0||78.8*||1||83.6||1||81.7||1||80.0||1||60.8|
|[Moro et al.2014]||WNet+Wiki13||g+CL||1||82.1||1||71.5|
|ppr default Hr||Wiki13||g||0||78.8*||1||83.6||1||81.7||2||80.0||2||60.8|
|[Bunescu and Pasca2006]||Wiki11||tc||0||83.8ra*||68.4ra|
|[Hachey et al.2011]||Wiki11||tcg||79.8*|
|[Hoffart et al.2012]||Wiki12||t||0||81.8*||0||64.6*|
|[Hoffart et al.2011]||Wiki11||tli+SUP||0||81.8*|
|[Milne and Witten2013]||Wiki13||la+SUP||57.5r||63.4r||40.0r||55.6r||37.1r|
|Best TAC KBP system||—||—||1||76.5||80.6||77.7|
|ppr default Hr||Wiki13||g||0||78.8*||1||83.6||1||81.7||2||80.0||2||60.8|
How important is the Wikipedia version? Table 7 shows that the versions we tested are not affecting the results dramatically, and that using the last version does not yield better results in NED. Perhaps the larger size and number of hyperlinks of newer versions would only affect new articles and rare articles, but not the ones present in TAC09. We kept using 2013 for test.
What is the efficiency of the algorithm? The initialization takes around 5 minutes151515Time measured in a single server with Xeon E7-4830 8 core processors, 2130 MHz, 64 GB RAM., where most of the time is spent loading the dictionary into memory, 4m50s. Using a database instead, initialization takes 10s. Memory requirements for Hr were 4.7 Gb, down to 1.1 Gb when using the database. The main bottleneck of our system is the computation of Personalized PageRank, each iteration taking around 0.60 seconds. We are currently checking fast approximations for Pagerank, and plan to improve efficiency.
In the previous section we presented several results on the same experimental conditions. We now use the graph and parametrization which yield the best results on development (default parameters with Hr). Comparison to the state of the art is complicated by many systems reporting results on different datasets, which causes the tables in this section to be rather sparse. The comparison for relatedness is straightforward, but, in NED, it is not possible to factor out the impact of the candidate generation step. Given the fact that our candidate generation procedure is not particularly sophisticated, we don’t think this is a decisive factor in favour of our results.
Table 8 and 9 report the results of the best systems on both tasks. Given that several systems were developed on test data, we also report our results on RG and TAC2009, marking all such results (see caption of tables for details). We split the results in both tables in three sets: top rows for systems using link and graph information alone, middle rows for link- and graph-based systems using WordNet and/or Wikipedia, and bottom rows for more complex systems. We report the results of our system repeatedly in each set of rows, for easier comparison. Our main focus is on the top rows, which show the superiority of our results with respect to other systems using Wikipedia links and graphs. The middle and bottom rows show the relation to the state of the art.
For easier exposition, we will examine the results by row section simultaneously on relatedness and NED. The top rows in Table 8 report four relatedness systems which have already been presented in Sect. 2, showing that our system is best in all five datasets. Note that the [Milne and Witten2013] row was obtained running their publicly available system with the supervised Machine Learning component turned off (see below for the results using SUP). The top rows of table 9 report the most frequent baseline (as produced by our dictionary) and three link-based systems (cf. Sect. 2), showing that our method is best in all five datasets. These results show that the use of the full graph as devised in this paper is a winning strategy.
The relatedness results in the middle rows of Table 8 include several systems using WordNet and/or Wikipedia (cf. Sect. 2), including the system in [Agirre et al.2010], which we run out-of-the-box with default values. To date, link-based systems using WordNet had reported stronger results than their counterparts on Wikipedia, but the table shows that our Wikipedia-based results are the strongest on all relatedness datasets but one (MC, the smallest dataset, with only 30 pairs). In addition, the table shows our results when combining random walks on Wikipedia and WordNet161616We multiply the scores of Ppr on Wikipedia and WordNet., which yields improvements in most datasets. In the counterpart for NED in Table 9, Moro et al. Moro:2014:ELmeetsWSD outperform our system, specially in the smaller KORE (143 instances), but note that they use a richer graph which combines WordNet, the English Wikipedia and hyperlinks from other language Wikipedias.
Finally, the bottom rows in both tables report the best systems to date. For lack of space, we cannot review systems not using Wikipedia links. Regarding relatedness, we can see that our combination of WordNet and Wikipedia would rank second in all datasets, with only one single system (based on corpora) beating our system in more than one dataset [Radinsky et al.2011]. Regarding NED, our system ranks first in the TAC datasets, including the best systems that participated in the TAC competitions [Varma et al.2009, Lehmann et al.2010, Cucerzan and Sil2013], and second to [Moro et al.2014] on AIDA and KORE.
This work departs from previous work based on Wikipedia and derived resources, as it focuses on a single knowledge source (links in Wikipedia) with a clear research objective: given a well-established random walk algorithm we explored which sources of links and filtering methods are useful, contrasting the use of the full graph with respect to using just direct links. We follow a clear development/test/analysis methodology, evaluating on a extensive range of both relatedness and NED datasets. All software and data are publicly available, with instructions to obtain out-of-the-box replicability171717http://ixa2.si.ehu.es/ukb/README.wiki.txt.
We show for the first time that random walks over the full graph of links improve over direct links. We studied several variations of sources of links, showing that non-reciprocal links hurt and that the contribution of the category structure and relations in infoboxes is residual. This paper sets a new state-of-the-art for systems based on Wikipedia links on both word relatedness and named-entity disambiguation datasets. The results are close to those of the best combined systems, which specialize on either relatedness or disambiguation, use several information sources and/or supervised machine learning techniques. This work shows that a careful analysis of varieties of graphs using a well-known random walk algorithm pays off more than most ad-hoc algorithms proposed up to date.
For the future, we would like to explore ways to filter out informative hyperlinks, perhaps weighting edges according to their relevance, and would also like to speed up the random-walk computations.
This article showed the potential of the graph of hyperlinks. We would like to explore combinations with other sources of information and algorithms, perhaps using supervised machine learning. For relatedness, we already showed improvement when combining with random walks over WordNet, but would like to explore tighter integration [Pilehvar et al.2013]. For NED, local methods [Ratinov et al.2011, Han and Sun2011], global optimization strategies based on keyphrases in context like KORE [Hoffart et al.2012] and doing NED jointly with word sense disambiguation [Moro et al.2014], all are complementary to our method and thus promising directions.
This work was partially funded by MINECO (CHIST-ERA READERS project – PCIN-2013-002- C02-01) and the European Commission (QTLEAP – FP7-ICT-2013.4.1-610516). Ander Barrena is supported by a PhD grant from the University of the Basque Country.
Collaboratively built semi-structured content and artificial intelligence: The story so far.Artif. Intell., 194:2–27, January.