Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation

03/05/2015 ∙ by Eneko Agirre, et al. ∙ UPV/EHU 0

Hyperlinks and other relations in Wikipedia are a extraordinary resource which is still not fully understood. In this paper we study the different types of links in Wikipedia, and contrast the use of the full graph with respect to just direct links. We apply a well-known random walk algorithm on two tasks, word relatedness and named-entity disambiguation. We show that using the full graph is more effective than just direct links by a large margin, that non-reciprocal links harm performance, and that there is no benefit from categories and infoboxes, with coherent results on both tasks. We set new state-of-the-art figures for systems based on Wikipedia links, comparable to systems exploiting several information sources and/or supervised machine learning. Our approach is open source, with instruction to reproduce results, and amenable to be integrated with complementary text-based methods.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Ukb: graph-based WSD and similarity

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hyperlinks and other relations between concepts and instances in Wikipedia have been successfully used in semantic tasks [Milne and Witten2013]. Still, many questions about the best way to leverage those links remain unanswered. For instance, methods using direct hyperlinks alone would wrongly disambiguate Lions in Figure 1 to B&I_Lions, a rugby team from Britain and Ireland, as it shares two direct links to potential referents in the context (Darrel Fletcher, a British football player, and Cape Town, the city where the team suffered some memorable defeats), while Highveld_Lions, a cricket team from South Africa, has only one. When considering the whole graph of hyperlinks we find that the cricket team is related to two cricketers named Alan Kourie and Duncan Fletcher and could thus pick the right entity for Lions in this context. In this paper we will study this and other questions about the use of hyperlinks in word relatedness [Gabrilovich and Markovitch2007] and named-entity disambiguation, NED [Hachey et al.2012].

Figure 1: Simplified example motivating the use of the full graph. It shows the disambiguation of Lions in “Alan Kourie, CEO of the Lions franchise, had discussions with Fletcher in Cape Town”. Each mention is linked to the candidate entities by arrows, e.g. B&I_Lions and Highveld_Lions for Lions. Solid lines correspond to direct hyperlinks and dashed lines to a path of several links. An algorithm using direct links alone would incorrectly output B&I_Lions, while one using the full graph would correctly choose Highveld_Lions.

Previous work on this area has typically focused on novel algorithms which work on a specific mix of resource, information source, task and test dataset (cf. Sect. 7). In the case of NED, the evaluation of the disambiguation component is confounded by interactions with mention spotting and candidate generation. With very few exceptions, there is little analysis of components and alternatives, and it is very difficult to learn any insight beyond the fact that the mix under study attained certain performance on the target dataset111See [Hachey et al.2012] and [García et al.2014] for two exceptions on NED. The first is limited to a single dataset, the second explores methods based on direct links, which we extend to using the full graph.. The number of algorithms and datasets is growing by the day, with no well-established single benchmark, and the fact that some systems are developed on test data, coupled with reproducibility problems [Fokkens et al.2013, on word relatedness], makes it very difficult to know where the area stands. There is a need for clear points of reference which allow to understand where each information source and algorithm stands with respect to other alternatives.

We thus depart from previous work, seeking to set such a point of reference, and focus on a single knowledge source (hyperlinks in Wikipedia) with a clear research objective: given a well-established random walk algorithm (Personalized PageRank [Haveliwala2002]) we explore sources of links and filtering methods, and contrast the use of the full graph with respect to using just direct links. We follow a clear development/test/analysis methodology, evaluating on a extensive range of both relatedness and NED datasets. The results are confirmed in both tasks, yielding more support to the findings in this research. All software and data are publicly available, with instructions to obtain out-of-the-box replicability222

The contributions of our research are the following: (1) We show for the first time that performing random walks over the full graph is preferable than considering only direct links. (2) We study several sources of links, showing that non-reciprocal links hurt and that the contribution of the category structure and links in infoboxes is residual. (3) We set the new state-of-the-art for systems based on Wikipedia links for both word relatedness and named-entity disambiguation. The results are close to the best systems to date, which use several information sources and/or supervised machine learning techniques, and specialize on either relatedness or disambiguation. Our work shows that a careful analysis of varieties of graphs using a well-known random walk algorithm pays off more than most ad-hoc algorithms.

The article is structured as follows. We first present previous work, followed by the different options to build hyperlink graphs. Sect. 4 reviews random walks for relatedness and NED. Sect. 5 sets the experimental methodology, followed by the analysis and results on development data (Sect. 6) and the comparison to the state of the art (Sect. 7). Finally, Sect. 8 draws the conclusions.

2 Previous work

The irruption of Wikipedia has opened up enormous opportunities for natural language processing

[Hovy et al.2013], with many derived knowledge-bases, including DBpedia [Bizer et al.2009], Freebase [Bollacker et al.2008], and BabelNet [Navigli and Ponzetto2012a], to name a few. These resources have been successfully used on semantic processing tasks like word relatedness, named-entity disambiguation (NED), also known as entity linking, and the closely related Wikification. Broadly speaking, Wikipedia-based approaches to those tasks can be split between those using the text in the articles (e.g., Gabrilovich and Markovitch, 2007) and those using the links between articles (e.g., Guo et al., 2011).

Relatedness systems take two words and return a high number if the two words are similar or closely related333Relatedness is more general than similarity. For the sake of simplicity, we will talk about relatedness on this paper. (e.g. professor - student), and a low number otherwise (e.g. professor - cucumber). Evaluation is performed comparing the returned values to those by humans [Rubenstein and Goodenough1965].

In NED [Hachey et al.2012] the input is a mention of a named-entity in context and the output is the appropriate instance from Wikipedia, DBpedia or Freebase (cf. Figure 1). Wikification is similar [Mihalcea and Csomai2007], but target terms include common nouns and only relevant terms are disambiguated. Note that the disambiguation component in Wikification and NED can be the same.

Our work focuses on relatedness and NED. We favored NED over Wikification because of the larger number of systems and evaluation datasets, but our conclusions are applicable to Wikification, as well as other Wikipedia-derived resources.

In this section we will focus on previous work using Wikipedia links for relatedness, NED and Wikification. Although relatedness and disambiguation are closely related (relatedness to context terms is an important disambiguation clue for NED), most of the systems are evaluated in either relatedness or NED, with few exceptions, like WikiMiner [Milne and Witten2013], KORE [Hoffart et al.2012] and the one presented in this paper.

Milne and Witten Milne08aneffective are the first to use hyperlinks between articles for relatedness. They compare two articles according to the number of incoming links that they have in common (i.e. overlap of direct-links) based on Normalized Google Distance (NGD), combined with several heuristics and collocation strength. In later work

[Milne and Witten2013], they incorporated machine learning. The authors also apply their technique to NED [Milne and Witten2008b]

, using their relatedness measures to train a supervised classifier. Unfortunately they do not present results of their link-based method alone, so we decided to reimplement it (cf. Sect.

6). We show that, under the same conditions, using the full-graph is more effective in both tasks. We also run their out-of-the-box system444 on the same datasets as ours (cf. Sect. 7), with results below ours.

Apart from hyperlinks between articles, other works on relatedness use the category structure [Strube and Ponzetto2006, Ponzetto and Strube2007, Ponzetto and Strube2011] to run path-based relatedness algorithms which had been successful on WordNet [Pedersen et al.2004], or use relations in infoboxes [Nastase and Strube2013]. In all cases, they obtain performance figures well below hyperlink-based systems (cf. Sect. 7). We will explore the contribution of such relations (cf. Sect. 3), incorporating them to the hyperlink graph.

Attempts to use the whole graph of hyperlinks for relatedness have been reported before. Yeh et al. yeh-EtAl:2009:TextGraphs4 obtained very low results on relatedness using an algorithm based on random walks similar to ours. Similar in spirit, Yazdani and Popescu-Belis Yazdani:2013:CTS:2405838.2405916 built a graph derived from the Freebase Wikipedia Extraction dataset, which is derived but richer than Wikipedia. Even if they mix hyperlinks with textual similarity, their results are lower than ours. One of the key differences with these systems is that we remove non-reciprocal links (cf. Sect. 3).

Regarding link-based methods for NED, there is only one system which relies exclusively on hyperlinks. Guo et al. guo_graph-based_2011 use direct hyperlinks between the target entity and the mentions in the context, counting the number of such links. We show that the use of the full graph produces better results.

The rest of NED systems present complex combinations. Lemahnn et al. Lehmann2010 present a supervised system combining features based on hyperlinks, categories, text similarity and relations from infoboxes. Despite their complex and rich system, we will show that they perform worse than our system. Hachey:2011:GNE:2050963.2050980 explored hyperlinks beyond direct links for NED, building subgraphs for each context using paths of length two departing from the context terms, combined with text-based relatedness. We will show that the full graph is more effective than limiting the distance to two, and report better results than their system. Several authors have included direct links using the aforementioned NGD in their combined systems [Ratinov et al.2011, Hoffart et al.2011]. Unfortunately, they do no report separate results for the NGD component. In very recent work DBLP:journals/jair/GarciaAF14 compare NGD with several other algorithms using direct links, but do not explore the full graph, or try to characterize links. We will see that their results are well below ours (cf. Sect. 7).

Graph-based algorithms for relatedness and disambiguation have been successfully used on other resources, particularly WordNet. Hughes and Ramage hughes2007emnlp were the first presenting a random walk algorithm over the WordNet graph. Agirre et al. AGIRRE10.534 improved over their results using a similar random walk algorithm on several variations of WordNet relations, reporting the best results to date among WordNet-based algorithms. The same algorithm was used for word sense disambiguation [Agirre et al.2014], also reporting state-of-the-art results. We use the same open source software in our experiments. As an alternative to random walks, Tsatsaronis et al. DBLP:journals/jair/TsatsaronisVV10 use a path-based system over the WordNet relation graph.

In more recent work [Navigli and Ponzetto2012b, Pilehvar et al.2013], the authors present two relatedness algorithms for BabelNet, an enriched version of WordNet including articles from Wikipedia, hyperlinks and cross-lingual relations from non-English Wikipedias. In related work, Moro et al. Moro:2014:ELmeetsWSD present a multi-step NED algorithm on BabelNet, building semantic graphs for each context. We will show that Wikipedia hyperlinks alone are able to provide similar performance on both tasks.

3 Building Wikipedia Graphs

Wikipedia pages can be classified into main articles, category pages, redirects and disambiguation pages. Given a Wikipedia dump (a snapshot from April 4, 2013), we mine links between articles, between articles and category pages, as well as the links between category pages (the category structure). Our graphs include a directed edge from one article to another iff the text of the first article contains a hyperlink to the second article. In addition, we also include hyperlinks in infoboxes.

The graph contains two types of nodes (articles and categories) and three types of directed edges: hyperlinks from article to article (H), infobox links from article to article (I), links from article to category and links from category to category (C).

We constructed several graphs using different combinations of nodes and edges. In addition to the directed versions (d) we also constructed an undirected version (u), and a reduced graph which only contains links which are reciprocal (r), that is, we add a pair of edges between and if and only if there exists a hyperlink from to and from to . Reciprocal links capture the intuition that both articles are relevant to each other, and tackle issues with links to low relevance articles, e.g. links to articles on specific years like 1984. Some authors weight links according to their relevance [Milne and Witten2013]. Our heuristic to keep only reciprocal links can be seen as a simpler, yet effective, method to avoid low relevance links.

Graph Edges Nodes RG TAC09
Cd 18,803K 4,873K 51.1 49.5
Cu 37,598K 4,873K 72.9 65.5
Id 6,572K 1,860K 43.1 57.0
Iu 12,692K 1,860K 52.8 65.5
Hd 90,674K 4,103K 75.1 65.0
Hu 165,258K 4,103K 76.6 66.0
Hr 16,338K 2,955K 88.4 68.5
HrCu 53,005K 4,898K 78.2 67.5
HrIu 26,394K 3,273K 82.9 68.0
HrCuIu 63,184K 4,900K 75.6 67.5
Table 1: Statistics for selected graphs and results on development data for relatedness (RG, Spearman) and NED (TAC09, accuracy) with default parameters (see text). See Sect. 4.1 for abbreviations. for stat. significant differences with Hr in either RG or TAC09. for stat. signif. when comparing on all relatedness or NED datasets.

Table 1 gives the number of nodes and edges in some selected graphs. The graph with less edges is the one with reciprocal hyperlinks Hr, and the graphs with most edges are those with undirected edges, as each edge is modeled as two directed edges555This was done in order to combine undirected and reciprocal edges, and could be avoided in other cases.. The number of nodes is similar in all, except for the infobox graphs (infoboxes are only available for a few articles), and the reciprocal graph Hr, as relatively few nodes have reciprocal edges.

Article Freq. Prob.
Gotham_City 32 0.38
Gotham_(magazine) 15 0.18
New_York_City 1 0.01
Gotham_Records 1 0.01
Table 2:

Partial view of dictionary entry for “gotham”. The probability is calculated as the ratio between the frequency and the total count.

3.1 Building the dictionary

In order to link running text to the articles in the graph, we use a dictionary, i.e., a static association between string mentions with all possible articles the mention can refer to.

We built our dictionary from the same Wikipedia dump, using article titles, redirections, disambiguation pages, and anchor text. Mention strings are lowercased and all text between parentheses is removed. If an anchor links to a disambiguation page, the text is associated with all possible articles the disambiguation page points to. Each association between a mention and article is scored with the prior probability, estimated as the number of times that the mention occurs in an anchor divided by the total number of occurrences of the mention as anchor. Note that our dictionary can disambiguate any mention, just returning the highest-scoring article. Table

2 partially shows a sample entry in our dictionary.

Drink Alcohol
Drink .124 Alcohol .145
Alcoholic_beverage .036 Alcoholic_beverage .026
Drinking .028 Ethanol .018
Coffee .020 Alkene .006
Tea .017 Alcoholism .006
Table 3:

Sample of the probability distribution returned by

ppr for two words. Top five articles shown.

4 Random Walks

The PageRank random walk algorithm [Brin and Page1998] is a method for ranking the vertices in a graph according to their relative structural importance. PageRank can be viewed as the result of a random walk process, where the final rank of node represents the probability of a random walk over the graph ending on node , at a sufficiently large time.

Personalized PageRank (ppr) is a variation of PageRank [Haveliwala2002]

, where the query of the user defines the importance of each node, biasing the resulting PageRank score to prefer nodes in the vicinity of the query nodes. The query bias is also called the teleport vector.

ppr has been successfully used on the WordNet graph for relatedness [Hughes and Ramage2007, Agirre et al.2010] and WSD [Agirre and Soroa2009, Agirre et al.2014]. In our experiments we use UKB version 2.1666, an open source software for relatedness and disambiguation based on ppr. For the sake of space, we will skip the details, and refer the reader to those papers. ppr has two parameters: the number of iterations, and the damping factor, which controls the relative weight of the teleport vector.

4.1 Random walks on Wikipedia

Given a dictionary and graph derived from Wikipedia (cf. Sect. 3), ppr expects a set of mentions, i.e., a set of strings which can be linked to Wikipedia articles via the dictionary. The method first initializes the teleport vector: for each mention in the input, the articles in the respective dictionary entry are set with an initial probability, and the rest of articles are set to zero. We explored two options to set the initial probability of each article: the uniform probability or the prior probability in the dictionary. When an article appears in the dictionary entry for two mentions, the initial probability is summed up. In a second step, we apply ppr for a number of iterations, producing a probability distribution over Wikipedia articles in the form of a ppr vector (ppv).

The probability vector can be used for both relatedness and NED. For relatedness we produce a ppv vector for each of the words to be compared, using the single word as input mention. The relatedness between the target words is computed as the cosine between the respective ppv vectors. In order to speed up the computation, we can reduce the size of the ppv vectors, setting to zero all values below rank after ordering the values in decreasing order.

Table 3 shows the top 5 articles in the ppv vectors of two sample words. The relatedness between pairs Drink and Alcohol would be non-zero, as their respective vectors contain common articles.

For NED the input comprises the target entity mention and its context, defined as the set of mentions occurring within a 101 token window centered in the target. In order to extract mentions to articles in Wikipedia from the context, we match the longest strings in our dictionary as we scan tokens from left to right. We then initialize the teleport probability with all articles referred by the mentions. After computing Personalized PageRank, we output the article with highest rank in ppv among the possible articles for the target entity mention. Figure 1 shows an example of NED.

If the prior is being used to initialize weights, we multiply the prior probability with the Pagerank probabilities before computing the final ranks. In the rare cases777Less than 3% of instances. where no known mention is found in the context, we return the node with the highest prior.

  1. Graphs in Table 1 (default: Hr)

  2. Number of iterations in PageRank
    (default: 30)

  3. Damping factor in PageRank:
    (default: 0.85)

  4. Initializing with prior or not (P or P) (default: P)

  5. Relatedness: number of values in ppv:
    (default: 5000)

Figure 2: Summary of variants and parameters as well as the default values for each of them.

Note that our NED and relatedness algorithms are related. NED is using using relatedness, as Pagerank probabilities are capturing how related is each candidate article to the context of the mention. Following the first-order and second-order co-occurrence abstraction [Islam and Inkpen2006, Agirre and Edmonds2007, Ch. 6], we can interpret that we do NED using first-order relatedness, while our relatedness uses second-order relatedness.

Figure 2 summarizes all parameters mentioned so far, as well as their default values, which were set following previous work [Agirre et al.2010, Agirre et al.2014].

5 Experimental methodology

We summarize the datasets used in Table 4. RG, MC and 353 are the most used relatedness datasets to date, with TSA and KORE being more recent datasets where some top-ranking systems have been evaluated. Word relatedness datasets were lemmatized and lowercased, except for KORE, which is an entity relatedness dataset where the input comprises article titles888We had to manually adjust the articles in KORE, as the exact title depends on the Wikipedia version. We missed 3 for our 2013 version, which could slightly degrade our results. . Following common practice rank-correlation (Spearman) was used for evaluation.

Name Reference #
RG [Rubenstein and Goodenough1965] 65
MC [Miller and Charles1991] 30
353 [Gabrilovich and Markovitch2007] 353
TSA [Radinsky et al.2011] 287
KORE [Hoffart et al.2012] 420
TAC09 [McNamee et al.2010] 1675
TAC10 1020
TAC13 1183
AIDA [Hoffart et al.2011] 4401
KORE [Hoffart et al.2012] 143
Table 4: Summary of relatedness (top) and NED (bottom) datasets. Rightmost column for number of instances.

Regarding NED, the TAC Entity Linking competition is held annually. Due to its popularity it is useful to set the state of the art. We selected the datasets in 2009 and 2010, as they have been used to evaluate several top ranking systems, as well as the 2013 dataset, which is the most recent. In addition, we also provide results for AIDA, the largest and only dataset providing annotations for all entities in the documents, and KORE, a recent, very small dataset focusing on difficult mentions and short contexts. Evaluation was performed using accuracy, the ratio between correctly disambiguated instances and the total number of instances that have a link to an entity in the knowledge base999Corresponds to non-NIL accuracy at TAC-KBP (also called KB accuracy) and Micro P@1.0 in [Hoffart et al.2011]. Each dataset uses a different Wikipedia version, but fortunately Wikipedia keeps redirects from older article titles to the new version. As customary in the task, we automatically map the articles returned by our system to the version used in the gold standard.

Following standard practice in NED, we do not evaluate mention detection101010See [Cornolti et al.2013] for a framework to evaluate both mention detection and disambiguation., that is, the datasets already specify which are the target mentions. Note that TAC provides so called “queries” which can be substrings of the full mention, e.g. “Smith” for a mention like “John Smith”). Given a mention, we devised the following heuristics to improve candidate generation: (1) remove substring contained in parenthesis from the mention, then check dictionary, (2) if not found, remove “the” if first token in the mention, then check dictionary, (3) if not found, remove middle token if mention contains three tokens, then check dictionary, (4) if not found, search for a matching entity using the Wikipedia API111111 The heuristics provide an improvement of around 4 points on development. Later analysis showed that these heuristics seem to be only relevant on the TAC datasets, because of the way the query strings are designed, but not on AIDA or KORE.

5.1 Development and test

We wanted to follow a standard experimental design, with a clear development/test split for each task. Unfortunately there is no standard split in the literature, and the choice is difficult: The development dataset should be representative enough to draw conclusions on different alternatives and parameters, but at the same time the most relevant datasets in the literature should be left for testing, in order to have enough points for comparison. In addition, some recent algorithms suposedly setting the state of the art are only tested on newly produced datasets. Note also that relatedness datasets are small, making it difficult to find statistically significant differences.

In order to strike a balance between the need for in-depth analysis and fair comparison to previous results, we decided to focus on the two oldest datasets from each task for development and analysis: RG for relatedness and a subset of 200 polysemic instances from TAC09 for NED (TAC09)121212The dataset in includes the subset.. The rest will be used for test, where the parameters have been set on development. Given the need for significant conclusions, we re-checked the main conclusions drawn from development data using the aggregation of all test datasets, but only after the comparison to the state of the art had been performed. This way we ensure both a fair comparison with the state of the art and a well-grounded analysis.

We performed significance tests using Fisher’s z-transformation for relatedness

[Press et al.2002, equation 14.5.10], and paired bootstrap resampling for NED [Noreen1989], accepting differences with p-value . Given the small size of the datasets, when necessary, we also report statistical significance when joining all datasets as just mentioned.

6 Studying the graph and parameters

In this section we study the performance of the different graphs and parameters on the two development datasets, RG and TAC09. The next section reports the results on the test sets for the best parameters, alongside state-of-the-art system results.

Graph Param. RG Param. TAC09
Hr default 88.4 default 68.5
Hr P 87.0 P 49.0
Hr 88.4 68.5
Hr 88.4 68.5
Hr 88.4
Table 5: Parameters: Summary of results on development data for relatedness (RG, Spearman correlation) and NED (TAC09, accuracy) for several parameters using Hr graph. Parameters are set to default values (see text) except for the one noted explicitly. for statistical significant differences with respect to default.

As mentioned in Sect. 4.1, ppr has several parameters and variants (cf. Figure 2). We first checked exhaustively all possible combinations for different graphs, with the rest of parameters set to default values. We then optimized each of the parameters in turn, seeking to answer the following questions:

Which links help most? Table 1 shows the results for selected graphs. The first seven rows present the results for each edge source in isolation, both using directed and undirected edges. Categories and infoboxes suffer from producing smaller graphs, with the hyperlinks yielding the best results. The undirected versions improve over directed links in all cases, with the use of reciprocal edges for hyperlinks obtaining the best results overall (the graphs with reciprocal edges for categories and infoboxes were too small and we omit them). The trend is the same in both relatedness and NED, highlighting the robustness of these results.

Regarding combined graphs, we report the most significant combinations. The reciprocal graph of hyperlinks outperforms all combinations (including the combinations which were omitted), showing that categories and infoboxes do not help or even degrade slightly the results. The differences are statistically significant (either on the individual datasets or in the aggregation on all datasets) in all cases, confirming that Hr is significantly better.

The degradation or lack of improvement when using infoboxes is surprising. We hypothesized that it could be caused by non-reciprocal links in HrIu. In fact, removing non-reciprocal links from HrIu improved results slightly on NED, matching those of Hr. This lack of improvement with infoboxes, even when removing non-reciprocal links, can be explained by the fact that only 5% of reciprocal links in Iu are not in Hr. It seems that this additional 5% is not helping in this particular dataset. Regarding categories, the category structure is mostly a tree, which is a structure where random walks do not seem to be effective, as already observed in [Agirre et al.2014] for WordNet.

Is initialization of random walks important? The second row in Table 5

reports the result when using uniform distributions when initializing the random walks (instead of prior probabilities). The results degrade in both datasets, the difference being significant only for NED. This was later confirmed in the rest of relatedness and NED datasets: using prior probabilities for initialization improves results in all cases, but it is only significant in NED datasets. These results show that relatedness is less sensitive to changes in the distribution of meanings, that is, using the more informative prior distributions of meaning only improves results slightly. NED, on the contrary, is more sensitive, as the distribution of senses affects dramatically the performance.

Is the value of and important? The best on both datasets was obtained with default values (cf. Table 5), in agreement with related work using WordNet [Agirre et al.2010]. The lowest number of iterations where convergence was obtained were 30 and 15, respectively, although as few as 5 iterations yielded very similar performance (87.1 on relatedness, 68.0 on NED).

Is the size of the vector, , important for relatedness? The best performance was attained for the default , with minor variations for .

Graph Method RG TAC09
Hr NGD 81.8 57.5
Hr ppr (1 iter.) 43.4 60.5
Hr ppr (2 iter.) 78.3 66.0
Hr ppr default 88.4 68.5
Table 6: Result when using single links, compared to the use of the full graph on development data. We reimplemented NGD. for stat. signif. difference with ppr. for stat. signif. using all datasets.
Graph Method Year RG TAC09
Hr ppr default 2010 86.3 68.5
Hr ppr default 2011 85.6 70.5
Hr ppr default 2013 88.4 68.5
Table 7: ppr using different Wikipedia versions

Is the full graph helping? When the ppr algorithm does a single iteration, we can interpret that it is ranking all entities using direct links. When doing two iterations, we can loosely say that it is using links at distance two, and so on. Table 6 shows that ppr is able to take profit from the full graph well beyond 2 iterations, specially in relatedness. These results were confirmed in the full set of datasets, with statistically significant differences in all cases.

In addition, we reimplemented the relatedness and NED algorithms based on NGD over direct links [Milne and Witten2008a, Milne and Witten2008b], allowing to compare them to ppr on the same experimental conditions. We first developed the relatedness algorithm131313In order to replicate the NGD relatedness algorithm, we checked the open source code available, exploring the use of inlinks and outlinks and the use of maximum pairwise article relatedness. We also realized that the use of priors (“commonness” according to the terminology in the paper) was hurting, so we dropped it. We checked both reciprocal and unidirectional versions of the hyperlink graph, with better results for the reciprocal graph.. Table 6 reports the best variant, which outperforms the 0.64 on RG reported in their paper. We followed a similar methodology for NED141414We checked both reciprocal and undirected graphs with similar results, combined with prior (similar results), weighted terms in the context (with improvement) and checked the use of ambiguous mentions in the context (marginal improvement). Reported results correspond to reciprocal, combination with prior, weighting terms and using only monosemous mentions. . Table 6 shows the results for NGD, which performs worse than ppr. This trend was confirmed on the full set of datasets for relatedness and NED with statistical significance in all cases except KORE, which is the smallest NED dataset. Figure 1 illustrates why the use of longer paths is beneficial. In fact, NGD returns 0.14 for B&I_Lions and 0.13 for Highveld_Lions, but ppr correctly returns 0.05 and 0.75, respectively.

Source RG 353 TSA MC KORE
[Ponzetto and Strube2011] Wiki11 c 75.0*
[Nastase and Strube2013] Wiki13 ci 67.0
[Milne and Witten2013] Wiki13 la 69.5r 59.7r 35.8r 77.2r 65.9r
[Yeh et al.2009] Wiki09 g 48.5
ppr default Hr Wiki13 g 0 88.4* 1 72.8 1 64.1 1 81.0 1 66.2
[Agirre et al.2010] WNet g 1 86.2r 68.5 45.4r 3 85.2r
[Tsatsaronis et al.2010] WNet g 86.1 61.0
[Navigli and Ponzetto2012b] WNet+Wiki12 (cl) g+CL 65.0 1 90.0
[Pilehvar et al.2013] WNet+Wiki13 g 86.8*
ppr default Hr Wiki13 g 0 88.4* 2 72.8 1 64.1 4 81.0 1 66.2
ppr default Hr WNet+Wiki13 g 0 91.8* 1 78.5 2 62.9 2 87.6 1 66.2
[Gabrilovich and Markovitch2007] Wiki07 t 82.0 75.0 59.0 73.0
[Hoffart et al.2012] Wiki12 t 0 69.8*
[Yazdani and Popescu-Belis2013] Freebase gt 70.0*
[Radinsky et al.2011] Time C 1 80.0 1 63.0
[Baroni et al.2014] Corpus C 84.0* 71.0
[Agirre et al.2009] WNet+Corpus Cg+SUP 0 96.0x 78.0x
[Milne and Witten2013] Wiki13 la+SUP 83.5r 74.0x 52.8r 81.3r 1 66.5r
ppr default Hr WNet+Wiki13 g 0 91.8* 2 78.5 2 62.9 2 87.6 2 66.2
Table 8: Spearman results for relatedness systems. The source column includes codes for information used (t for article text, l for direct hyperlinks, g for hyperlink graph, c for categories, i for infoboxes, a for anchor text) and other information sources (CL for crosslingual links, C for corpora, SUP for supervised Machine Learning). The results include the following codes: * for best reported result among several variants, x for cross-validation result, r for third-party system ran by us. We also include the rank of our ppr system in each group or rows, including the systems above it (excluding * and x systems, which get rank 0 if they are top rank).
System Source TAC2009 TAC2010 TAC2013 AIDA KORE50
MFS baseline Wiki13 l 68.3 73.7 72.7 69.0 36.4
[Guo et al.2011] Wiki10 l 1 74.0 74.1
[Milne and Witten2013] Wiki13 la 57.4r 58.5r 37.1r 56.0r 35.7r
[García et al.2014] Wiki12 l 76.6
ppr default Hr Wiki13 g 0 78.8* 1 83.6 1 81.7 1 80.0 1 60.8
[Moro et al.2014] WNet+Wiki13 g+CL 1 82.1 1 71.5
ppr default Hr Wiki13 g 0 78.8* 1 83.6 1 81.7 2 80.0 2 60.8
[Bunescu and Pasca2006] Wiki11 tc 0 83.8ra* 68.4ra
[Cucerzan2007] Wiki11 tc 0 83.5ra* 78.4ra 51.0ro
[Hachey et al.2011] Wiki11 tcg 79.8*
[Hoffart et al.2012] Wiki12 t 0 81.8* 0 64.6*
[Hoffart et al.2011] Wiki11 tli+SUP 0 81.8*
[Milne and Witten2013] Wiki13 la+SUP 57.5r 63.4r 40.0r 55.6r 37.1r
Best TAC KBP system 1 76.5 80.6 77.7
ppr default Hr Wiki13 g 0 78.8* 1 83.6 1 81.7 2 80.0 2 60.8
Table 9: Accuracy of NED systems, using the same codes as in in Table 8. Some early systems have been re-implemented and tested by others: ra for [Hachey et al.2012], ro [Hoffart et al.2011]. We report rank of our ppr system in each group or rows, including systems above (excluding * systems, which get rank 0 if they are top rank).

How important is the Wikipedia version? Table 7 shows that the versions we tested are not affecting the results dramatically, and that using the last version does not yield better results in NED. Perhaps the larger size and number of hyperlinks of newer versions would only affect new articles and rare articles, but not the ones present in TAC09. We kept using 2013 for test.

What is the efficiency of the algorithm? The initialization takes around 5 minutes151515Time measured in a single server with Xeon E7-4830 8 core processors, 2130 MHz, 64 GB RAM., where most of the time is spent loading the dictionary into memory, 4m50s. Using a database instead, initialization takes 10s. Memory requirements for Hr were 4.7 Gb, down to 1.1 Gb when using the database. The main bottleneck of our system is the computation of Personalized PageRank, each iteration taking around 0.60 seconds. We are currently checking fast approximations for Pagerank, and plan to improve efficiency.

7 Comparison to related work

In the previous section we presented several results on the same experimental conditions. We now use the graph and parametrization which yield the best results on development (default parameters with Hr). Comparison to the state of the art is complicated by many systems reporting results on different datasets, which causes the tables in this section to be rather sparse. The comparison for relatedness is straightforward, but, in NED, it is not possible to factor out the impact of the candidate generation step. Given the fact that our candidate generation procedure is not particularly sophisticated, we don’t think this is a decisive factor in favour of our results.

Table 8 and 9 report the results of the best systems on both tasks. Given that several systems were developed on test data, we also report our results on RG and TAC2009, marking all such results (see caption of tables for details). We split the results in both tables in three sets: top rows for systems using link and graph information alone, middle rows for link- and graph-based systems using WordNet and/or Wikipedia, and bottom rows for more complex systems. We report the results of our system repeatedly in each set of rows, for easier comparison. Our main focus is on the top rows, which show the superiority of our results with respect to other systems using Wikipedia links and graphs. The middle and bottom rows show the relation to the state of the art.

For easier exposition, we will examine the results by row section simultaneously on relatedness and NED. The top rows in Table 8 report four relatedness systems which have already been presented in Sect. 2, showing that our system is best in all five datasets. Note that the [Milne and Witten2013] row was obtained running their publicly available system with the supervised Machine Learning component turned off (see below for the results using SUP). The top rows of table 9 report the most frequent baseline (as produced by our dictionary) and three link-based systems (cf. Sect. 2), showing that our method is best in all five datasets. These results show that the use of the full graph as devised in this paper is a winning strategy.

The relatedness results in the middle rows of Table 8 include several systems using WordNet and/or Wikipedia (cf. Sect. 2), including the system in [Agirre et al.2010], which we run out-of-the-box with default values. To date, link-based systems using WordNet had reported stronger results than their counterparts on Wikipedia, but the table shows that our Wikipedia-based results are the strongest on all relatedness datasets but one (MC, the smallest dataset, with only 30 pairs). In addition, the table shows our results when combining random walks on Wikipedia and WordNet161616We multiply the scores of Ppr on Wikipedia and WordNet., which yields improvements in most datasets. In the counterpart for NED in Table 9, Moro et al. Moro:2014:ELmeetsWSD outperform our system, specially in the smaller KORE (143 instances), but note that they use a richer graph which combines WordNet, the English Wikipedia and hyperlinks from other language Wikipedias.

Finally, the bottom rows in both tables report the best systems to date. For lack of space, we cannot review systems not using Wikipedia links. Regarding relatedness, we can see that our combination of WordNet and Wikipedia would rank second in all datasets, with only one single system (based on corpora) beating our system in more than one dataset [Radinsky et al.2011]. Regarding NED, our system ranks first in the TAC datasets, including the best systems that participated in the TAC competitions [Varma et al.2009, Lehmann et al.2010, Cucerzan and Sil2013], and second to [Moro et al.2014] on AIDA and KORE.

8 Conclusions and Future Work

This work departs from previous work based on Wikipedia and derived resources, as it focuses on a single knowledge source (links in Wikipedia) with a clear research objective: given a well-established random walk algorithm we explored which sources of links and filtering methods are useful, contrasting the use of the full graph with respect to using just direct links. We follow a clear development/test/analysis methodology, evaluating on a extensive range of both relatedness and NED datasets. All software and data are publicly available, with instructions to obtain out-of-the-box replicability171717

We show for the first time that random walks over the full graph of links improve over direct links. We studied several variations of sources of links, showing that non-reciprocal links hurt and that the contribution of the category structure and relations in infoboxes is residual. This paper sets a new state-of-the-art for systems based on Wikipedia links on both word relatedness and named-entity disambiguation datasets. The results are close to those of the best combined systems, which specialize on either relatedness or disambiguation, use several information sources and/or supervised machine learning techniques. This work shows that a careful analysis of varieties of graphs using a well-known random walk algorithm pays off more than most ad-hoc algorithms proposed up to date.

For the future, we would like to explore ways to filter out informative hyperlinks, perhaps weighting edges according to their relevance, and would also like to speed up the random-walk computations.

This article showed the potential of the graph of hyperlinks. We would like to explore combinations with other sources of information and algorithms, perhaps using supervised machine learning. For relatedness, we already showed improvement when combining with random walks over WordNet, but would like to explore tighter integration [Pilehvar et al.2013]. For NED, local methods [Ratinov et al.2011, Han and Sun2011], global optimization strategies based on keyphrases in context like KORE [Hoffart et al.2012] and doing NED jointly with word sense disambiguation [Moro et al.2014], all are complementary to our method and thus promising directions.


This work was partially funded by MINECO (CHIST-ERA READERS project – PCIN-2013-002- C02-01) and the European Commission (QTLEAP – FP7-ICT-2013.4.1-610516). Ander Barrena is supported by a PhD grant from the University of the Basque Country.


  • [Agirre and Edmonds2007] Eneko Agirre and Philip Edmonds. 2007. Word Sense Disambiguation: Algorithms and Applications. Springer Publishing Company, Incorporated, 1st edition.
  • [Agirre and Soroa2009] E. Agirre and A. Soroa. 2009. Personalizing PageRank for Word Sense Disambiguation. In Proceedings of 14th Conference of the European Chapter of the Association for Computational Linguistics, Athens, Greece.
  • [Agirre et al.2009] E. Agirre, A. Soroa, E. Alfonseca, K. Hall, J. Kravalova, and M. Pasca. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In Proceedings of annual meeting of the North American Chapter of the Association of Computational Linguistics (NAAC), Boulder, USA, June.
  • [Agirre et al.2010] E. Agirre, M. Cuadros, G. Rigau, and A. Soroa. 2010. Exploring Knowledge Bases for Similarity. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, May. European Language Resources Association (ELRA).
  • [Agirre et al.2014] Eneko Agirre, Oier Lopez de Lacalle, and Aitor Soroa. 2014. Random walks for knowledge-based word sense disambiguation. Computational Linguistics, 40(1):57–88.
  • [Baroni et al.2014] Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of ACL.
  • [Bizer et al.2009] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. 2009. Dbpedia - a crystallization point for the web of data. Web Semant., 7(3):154–165, September.
  • [Bollacker et al.2008] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pages 1247–1250, New York, NY, USA. ACM.
  • [Brin and Page1998] S. Brin and L. Page. 1998. The Anatomy of a Large-scale Hypertextual Web Search Engine. In Proceedings of the seventh international conference on World Wide Web 7, WWW7, pages 107–117, Amsterdam, The Netherlands, The Netherlands. Elsevier Science Publishers B. V.
  • [Bunescu and Pasca2006] Razvan C. Bunescu and Marius Pasca. 2006. Using encyclopedic knowledge for named entity disambiguation. In EACL. The Association for Computer Linguistics.
  • [Cornolti et al.2013] Marco Cornolti, Paolo Ferragina, and Massimiliano Ciaramita. 2013. A framework for benchmarking entity-annotation systems. In Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, page 249–260, Republic and Canton of Geneva, Switzerland. International World Wide Web Conferences Steering Committee.
  • [Cucerzan and Sil2013] Silviu Cucerzan and Avirup Sil. 2013. The msr systems for entity linking and temporal slot filling at tac 2013. In Proceedings of the Sixth Text Analysis Conference (TAC 2013), page 10. National Institute of Standards and Technology (NIST).
  • [Cucerzan2007] S. Cucerzan. 2007. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of EMNLP-CoNLL, volume June, pages 708–716.
  • [Fokkens et al.2013] Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen, and Nuno Freire. 2013. Offspring from reproduction problems: What replication failure teaches us. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1691–1701, Sofia, Bulgaria, August. Association for Computational Linguistics.
  • [Gabrilovich and Markovitch2007] E. Gabrilovich and S. Markovitch. 2007. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. Proc of IJCAI, pages 6–12.
  • [García et al.2014] Norberto Fernández García, Jesús Arias-Fisteus, and Luis Sánchez Fernández. 2014. Comparative evaluation of link-based approaches for candidate ranking in link-to-wikipedia systems. J. Artif. Intell. Res. (JAIR), 49:733–773.
  • [Guo et al.2011] Yuhang Guo, Wanxiang Che, Ting Liu, and Sheng Li. 2011. A graph-based method for entity linking. In Proceedings of 5th International Joint Conference on Natural Language Processing, page 1010–1018, Chiang Mai, Thailand, November. Asian Federation of Natural Language Processing.
  • [Hachey et al.2011] B. Hachey, W. Radford, and J.R. Curran. 2011. Graph-based Named Entity Linking with Wikipedia. In Proceedings of the 12th international conference on Web information system engineering, WISE’11, pages 213–226, Berlin, Heidelberg. Springer-Verlag.
  • [Hachey et al.2012] B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J.R. Curran. 2012. Evaluating Entity Linking with Wikipedia. Artif. Intell., 194:130–150, January.
  • [Han and Sun2011] X. Han and L. Sun. 2011. A Generative Entity-mention Model for Linking Entities with Knowledge Base. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, volume 1, pages 945–954.
  • [Haveliwala2002] T.H. Haveliwala. 2002. Topic-sensitive PageRank. In Proceedings of the 11th international conference on World Wide Web (WWW’02), pages 517–526, New York, NY, USA.
  • [Hoffart et al.2011] J. Hoffart, M.A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and G. Weikum. 2011. Robust Disambiguation of Named Entities in Text. In Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, United Kingdom 2011, pages 782–792.
  • [Hoffart et al.2012] Johannes Hoffart, Stephan Seufert, Dat Ba Nguyen, Martin Theobald, and Gerhard Weikum. 2012. Kore: Keyphrase overlap relatedness for entity disambiguation. In Proceedings of the 21st ACM international conference on Information and knowledge management, page 545–554.
  • [Hovy et al.2013] Eduard Hovy, Roberto Navigli, and Simone Paolo Ponzetto. 2013.

    Collaboratively built semi-structured content and artificial intelligence: The story so far.

    Artif. Intell., 194:2–27, January.
  • [Hughes and Ramage2007] T. Hughes and D. Ramage. 2007. Lexical Semantic Relatedness with Random Graph Walks. In Proceedings of EMNLP-CoNLL-2007, pages 581–589.
  • [Islam and Inkpen2006] A. Islam and D. Inkpen. 2006. Second order co-occurrence pmi for determining the semantic similarity of words. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), pages 1033–1038.
  • [Lehmann et al.2010] J. Lehmann, S. Monahan, L. Nezda, A. Jung, and Y. Shi. 2010. LCC Approaches to Knowledge Base Population at TAC 2010. In Proceedings of the Text Analysis Conference.
  • [McNamee et al.2010] P. McNamee, H.T. Dang, H. Simpson, P. Schone, and S.M. Strassel. 2010. An Evaluation of Technologies for Knowledge Base Population. In Proceedings of the 7th International Conference on Language Resources and Evaluation, page 369–372.
  • [Mihalcea and Csomai2007] Rada Mihalcea and Andras Csomai. 2007. Wikify!: linking documents to encyclopedic knowledge. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 233–242. ACM.
  • [Miller and Charles1991] George A. Miller and Walter G. Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28.
  • [Milne and Witten2008a] D. Milne and I.H. Witten. 2008a. An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence.
  • [Milne and Witten2008b] D. Milne and I.H. Witten. 2008b. Learning to Link with Wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge mining - CIKM ’08, page 509, New York, New York, USA. ACM Press.
  • [Milne and Witten2013] David Milne and Ian H. Witten. 2013. An open-source toolkit for mining wikipedia. Artificial Intelligence, 194:222–239, January.
  • [Moro et al.2014] Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association of Computational Linguistics, 2:231–244, May.
  • [Nastase and Strube2013] V. Nastase and M. Strube. 2013. Transforming Wikipedia into a large Scale Multilingual Concept Network. Artif. Intell., 194:62–85.
  • [Navigli and Ponzetto2012a] R. Navigli and S.P. Ponzetto. 2012a. BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network. Artificial Intelligence, 193:217–250.
  • [Navigli and Ponzetto2012b] R. Navigli and S.P. Ponzetto. 2012b. BabelRelate! A Joint Multilingual Approach to Computing Semantic Relatedness. In Jörg Hoffmann and Bart Selman, editors, AAAI. AAAI Press.
  • [Noreen1989] E. W. Noreen. 1989. Computer-Intensive Methods for Testing Hypotheses. John Wiley & Sons.
  • [Pedersen et al.2004] Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. Wordnet::similarity: Measuring the relatedness of concepts. In Demonstration Papers at HLT-NAACL 2004, HLT-NAACL–Demonstrations ’04, pages 38–41, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [Pilehvar et al.2013] Mohammad Taher Pilehvar, David Jurgens, and Roberto Navigli. 2013. Align, Disambiguate and Walk: a Unified Approach for Measuring Semantic Similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 1341–1351, Sofia, Bulgaria.
  • [Ponzetto and Strube2007] S.P. Ponzetto and M. Strube. 2007. Knowledge derived from Wikipedia for computing semantic relatedness. Journal of Artificial Intelligence Research, 30:181–212.
  • [Ponzetto and Strube2011] S.P. Ponzetto and M. Strube. 2011. Taxonomy Induction based on a Collaboratively built Knowledge Repository. Artificial Intelligence, 175:1737–1756.
  • [Press et al.2002] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. 2002. Numerical Recipes: The Art of Scientific Computing V 2.10 With Linux Or Single-Screen License. Cambridge University Press.
  • [Radinsky et al.2011] Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, WWW ’11, pages 337–346, New York, NY, USA. ACM.
  • [Ratinov et al.2011] L.A. Ratinov, D. Roth, D. Downey, and M. Anderson. 2011. Local and Global Algorithms for Disambiguation to Wikipedia. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pages 1375–1384. The Association for Computer Linguistics.
  • [Rubenstein and Goodenough1965] H. Rubenstein and J.B. Goodenough. 1965. Contextual Correlates of Synonymy. Communications of the ACM, 8(10):627–633.
  • [Strube and Ponzetto2006] Michael Strube and Simone Paolo Ponzetto. 2006. Wikirelate! computing semantic relatedness using wikipedia. In Proceedings of the National Conference on Artificial Intelligence, volume 21, pages 1419–1424. Menlo Park, CA; Cambridge, MA; London; AAAI Press.
  • [Tsatsaronis et al.2010] G. Tsatsaronis, I. Varlamis, and M. Vazirgiannis. 2010. Text Relatedness Based on a Word Thesaurus. J. Artif. Intell. Res. (JAIR), 37:1–39.
  • [Varma et al.2009] V. Varma, V. Bharat, S. Kovelamudi, P. Bysani, S. GSK, K. Kumar N, K. Reddy, K. Kumar, and N. Maganti. 2009. IIIT Hyderabad at TAC 2009. Technical report.
  • [Yazdani and Popescu-Belis2013] Majid Yazdani and Andrei Popescu-Belis. 2013. Computing text semantic relatedness using the contents and links of a hypertext encyclopedia. Artificial Intelligence, 194:176–202, January.
  • [Yeh et al.2009] E. Yeh, D. Ramage, C.D. Manning, E. Agirre, and A. Soroa. 2009. WikiWalk: Random walks on Wikipedia for Semantic Relatedness. In Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4), pages 41–49, Suntec, Singapore, August. Association for Computational Linguistics.