We tackle NED by comparing entities in short sentences with graphs. Creating a context vector from graphs through deep learning is a challenging problem that has never been applied to NED. Our main contribution is to present an experimental study of recent neural techniques, as well as a discussion about which graph features are most important for the disambiguation task. In addition, a new dataset () is created to allow a clean and scalable evaluation of NED with entries, and to be used as a reference in future research. In the end our results show that a Bi-LSTM encoding of the graph triplets performs best, improving upon the baseline models and scoring an F1 value of 91.6% on the test setREAD FULL TEXT VIEW PDF
A mentioned entity in a text may refer to multiple entities in a knowledge base. The process of correctly linking a mention to the relevant entity is called Entity Linking (EL) or NED (Bunescu and Pasca, 2006). Entity disambiguation is different from Named Entity Recognition (NER), where the system must detect the relevant mention boundaries given a definite set of entity types. In NED, the system must be able to generate a context for an entity in a text and an entity in a knowledge base, then correctly link the two. NED is a crucial step in web search tasks (Blanco et al., 2015; Artiles et al., 2009; Cucerzan, 2007), data mining (Dorssers et al., 2017; Chang et al., 2016; Hoffart et al., 2011), and semantic search (Meij et al., 2014; Dietz et al., 2017). Arguably, disambiguation falls into the category of tasks where humans still vastly outperform algorithmic solutions. This is what makes research into NED so relevant in today’s data mining landscape.
A background knowledge base can appear in many forms: as a collection of texts, as a relational database, or as a collection of graphs in a graph database. Representing data as ensembles of linked information is an increasing popular form of storage. One example can be found in the successful Wikidata database (Vrandečić, 2012), which aims to mirror the content of Wikipedia in a linked format.
Both Wikipedia and Wikidata contain potentially ambiguous entities. For example, when searching for information about Captain Marvel, the results should depend on the context in which this entity appears. Indeed, the name Captain Marvel is a character from Marvel comics and a nickname for Michael Jordan, the basketball player.
With Wikidata the ambiguity can be resolved by looking at the information linked to both entities, as we show in Fig. 1. More specifically, the issue of disambiguating entities using information in a graph format has been addressed sparsely in the literature. We wish to contribute on this topic with the current paper.
The main contributions of this work are two-fold: First, we aim to empirically evaluate different deep learning techniques to create a context vector from graphs, aimed at high-accuracy NED. This is the most novel aspect of our work, as there is currently no study on a neural approach for entity disambiguation using graphs as background knowledge. Current state-of-the-art algorithms (Raiman and Raiman, 2018) are able to build a context from Wikipedia pages, but many academic and commercial projects use a graph-like knowledge base. An excellent study on NED with graphs has been done in 2014 within (Usbeck et al., 2014), when the techniques for using neural networks on graphs were still under-developed. Deep learning has since been on a fast-growing trajectory, often providing the best performance. Hopefully, our work can provide directions on which neural tools are most appropriate and which graph features are most important for the task at hand. Specifically, we explore whether representing graphs as triplets is more useful than using the full topological information of the graph, and what features can be ignored and still achieve an acceptable disambiguation rate.
Secondly, we create a new dataset to help us in our endeavor. Among the datasets available for this task we took inspiration from Wiki-Disamb30 (Ferragina and Scaiella, 2010). In that work, Ferragina and Scaiella tackle the problem of cross referencing text fragments with Wikipedia pages. Specifically, they deal with very short sentences (- words). We build on their work by translating the pointers of Wikipedia pages to Wikidata items, thereby creating an ad hoc dataset based on Wiki-Disamb30. We call our derivative dataset Wikidata-Disamb. This new dataset creates the perfect playground for us to test various models of NED on Wikidata.
The paper is organized as follows: Sec. 2 provides a concise description of our task, while in Sec. 3 we describe all our models. In Sec. 4 we explain how the dataset has been created and in Sec. 5 we detail the training used. In Sec. 6
we summarize the results and discuss the relevance of our models for classifyingWikidata entries. Finally, a review of similar results is presented in Sec. 7, and Sec. 8 concludes and suggests further research directions.
All models share three main elements: A graph, a text, and an entity in the text to disambiguate. The disambiguation task is reduced to a consistency test between the input text and the graph.
The graph is composed by nodes connected with edges. The node vectors are represented by the centroid of the Glove word vectors that make up the nodes: For example, a node called "New York" is represented by averaging the word vectors of "New" and "York". An edge connects node with node . The set of edge vectors is computed exactly as for the node vectors, by averaging over the word vectors in each of the edge’s labels: For example, the vector of "instance of" is the average of the Glove vectors "instance" and "of". The values of the adjacency matrix of a graph are set to in the elements that are connected by a vertex and otherwise.
The text is described as a sequence of word vectors , represented using the Glove embeddings, while the item is used to query the Wikidata dataset for the corresponding entry. In most models we have an embedding for the input text and one for the graph .
All our models receive as an input the node vectors , the word embeddings (and possibly the edge vectors ). The output of our models is a binary vector, which tells us whether the input graph is consistent with the entity in the text.
The size of the training dataset ( items) allows us to experiment with relatively complex models. In the end we train nine different models, five of which are baselines. The configuration of each model is summed up in Appendix A.
The Wikidata graphs need to be processed by a neural network. To do so we can either represent the graph as list of triplets - thereby effectively losing the topology of the network - or by employing a method that encodes the topology in the final embeddings. In the following we address both representations by employing a Recurrent Neural Network (RNN) and a Graph Convolutional Network (GCN) (Kipf and Welling, 2016) respectively.
In all these models the input text is treated in the same way: The text word vectors are first fed to a Bi-LSTM, with outputs . These outputs are then weighted by a mask: A set of scalars which are where the item is supposed to be and otherwise. For example the sentence "The comic book hero Captain Marvel is …" would have . This mask acts as a "manually induced" attention of the item to disambiguate for. The final output of the Long Short-Term Memory (LSTM) mechanism is the the average
given a sentence with length .
The following items are our graph-based models:
In this model, represented in Fig. 2, a Bi-LSTM (Augenstein et al., 2016) is applied over the sequence of triplets in the graph. In the list of input vectors each item is the concatenation of three elements:
where are all the indices between connected nodes in a directed graph. The final states of the Bi-LSTM are then concatenated and then fed to a dense layer, whose output is the graph embedding .
While this model captures the information of single hops in the graph, it is not suited for capturing the topology of the network. For example, nodes that are topologically close might appear far away in the set of triplets. More importantly, the final embeddings might depend on the specific ordering of the triplets, losing the information about the network shape.
We improve upon the prior model by adding an attention mechanism (Bahdanau et al., 2014; vas, 2017) after the LSTM for triplets (Fig. 3). The output vectors of the LSTM are weighted by an attention coefficient (scalar) and then summed together to create the context vector for the graph.
Where the matrices and and the vector are learned in training.
We expect this attention method to improve the disambiguation task by giving more weight to relevant triplets.
GCNs have been able to provide state-of-the-art results for Entity Prediction (Schlichtkrull et al., 2017), Semantic Role Labelling (Marcheggiani and Titov, 2017), matrix completion for recommender systems (Berg et al., 2017), and relational inference (Kipf et al., 2018). It seems therefore natural to use graph convolutions for our NED task. Specifically, the convolutions can be employed to create an embedding vector of the relevant Wikidata graph.
A graph convolutional network works by stacking convolutional layers based on the topology of the network. Typically, by stacking together layers the network can propagate the features of nodes that are at most hops away. The information at the layer is propagated to the next one according to the equation
where and are two indices of nodes in the graph. is the set of nearest neighbors of node , plus the node itself. The vector represents node ’s embeddings at the layer. The matrix and vector are learned during training and map the embeddings of node onto the adjacent nodes in the graph. In this paper the we only consider the outgoing edges from each node.
With the topology of the Wikidata graph, the information of each node is propagated onto the central item. Ideally, after the graph convolutions, the vector at the position of the central item summarizes the information in the graph.
One the challenges of the original formulation of GCN is about including the information contained in the edges’ labels (which are not present in Eq. 5). One way to solve this issue is to see the convolutions as a form of message passing (Gilmer et al., 2017; Kipf et al., 2018) .
We do not explicitly use the message passing technique. Instead we opt for a similar solution: we reify the relations to appear as additional nodes (see Fig. 5). In this way the edges become nodes themselves. The end result is comparable to the message passing model, where information flows from a vertex to an edge and is eventually dispatched to another node.
The original formulation of GCN can therefore be applied to this modified graph.
GCNs are designed to capture the topology of the graph. The final vector contains information that comes from the node vectors, the edge vectors, and the adjacency matrix. These components end up building the vector embedding .
The coefficients signify the attention to be paid to the information being passed from node to node . This attention needs to be a function of the vector and edge nodes, as well as a function of the input text. We choose the following method for GCN attention:
where the function acts on the last dimension of and is the original adjacency matrix for the graph. The matrix models the information propagating from a node in the context of the input text: The columns of the matrix are the layer vectors , with the number of nodes and the dimension of the layer embeddings; is a matrix where all the columns are identical and equal to the input text embeddings , with the dimension of the text embeddings. In the matrices and and in the bias matrix is an arbitrary intermediate dimension.
If models the outgoing information from each node, the matrix then models the information arriving to the nodes.
In the end () is the set of weights for messages being propagated among all the nodes.
A final element-wise multiplication with masks out the elements that are not connected.
To the best of our knowledge, this formulation of GCN attention is original.
The evaluation of previous models would not be meaningful without a set of baselines. To this end, we want to know how much of the graph information is useful to achieve the best accuracy.
Vector distance baseline: This is the simplest method, based on the hypothesis that the input text might be somehow semantically closer to the correct graph than to the wrong one. We therefore take as inputs the simple average of the text vectors and the average of all the nodes’ vectors in the graph. The classification task is then performed by finding a distance according to which
for the correct graph, and
for the incorrect graph. We chose as the distance that maximizes the score in the training dataset.
Feedforward of averages: We take the average of the words in the sentence and the average of all the nodes in the graph, concatenate them, and feed them to a feedforward neural net with one hidden layer. The final output is binary, meaning that the sentence can be either consistent or inconsistent with the Wikidata graph.
Text LSTM + Linear attention: Instead of using the average
for representing the graph, we employ an attention model over the node vectors. The output of this attention model is
Where , , and
are learned through backpropagation. This attention technique ideally improves the classification task by giving more weight to relevant nodes.
We create a new dataset from the information in the Wiki-Disamb30 set. Originally, the dataset addressed the need of a corpus of very short texts (a few 30-40 words) where a specific entity was linked to the correct Wikipedia page. The original dataset contains about 2 million entries and presents three elements for each one: an English sentence, the name of the entity to disambiguate, and the correct Wikipedia item corresponding to the entity.
One example is presented in Table 1.
|fantasy novelist David Gemmell. Achilles is featured heavily in the novel The Firebrand by Marion Zimmer Bradley. The comic book hero Captain Marvel is endowed with the courage of Achilles, as well||Captain Marvel||403585|
|Text||Entity||Correct Wikidata ID||Wrong Wikidata ID|
|fantasy novelist David Gemmell. Achilles is featured heavily in the novel The Firebrand by Marion Zimmer Bradley. The comic book hero Captain Marvel is endowed with the courage of Achilles, as well||Captain Marvel||Q534153||Q41421|
The ambiguous item here is the name Captain Marvel, which is a character from Marvel comics and – for example – a nickname for Michael Jordan, the basketball player. The correct interpretation in the example sentence is the former.
Our dataset provides a conversion from the Wikipedia page to a Wikidata item, when this conversion exists. If the conversion is not possible the original entry is simply skipped. In order to have a consistent disambiguation task we also select an incorrect Wikidata item to pair with the correct one, linked to it by having the same name (or same alias).
One example item is as in Table 2: The correct item is Q534153, the Wikidata entry of Captain Marvel. The incorrect entry is Q41421, which is the entry for Michael Jordan - also know as Captain Marvel. This incorrect entry is selected to not be trivial, i.e., a disambiguation page or an entry with no triplets. In this way we obtain a balanced dataset, where the correct entity appears as many times as the wrong one.
After applying those selection constraints - and keeping in mind scalability issues - we chose 120000 items, of which 100 thousand in the training set, and 10000 entries each for the development and test sets. This information is then fed to our models.
Each model performance is measured on how well it can predict if a Wikidata graph is consistent with the entity in the input sentence. Since we measure the consistency with correct and wrong Wikidata IDs separately, the size of the training set effectively doubles, with 200 thousand graphs to compare with their respective sentences. Likewise, in the development and test sets we compare the consistency predictions of 20 thousand graphs with the relevant texts.
to implement our neural network. The weights are initialized randomly from the uniform distribution and the initial state of theLSTMs
are set to zero. We use binary cross entropy as the final loss function. An Adam optimizer(Kingma and Ba, 2014) is used with a step of . Whenever applicable, we employ a batch size of 10.
The dimension of the Glove vectors used in our experiments is 300, and in all our tests we cut the Wikidata graph after 2 hops from the central node. This hopping distance has been selected to maximize the perfomance of the GCN based models.
|Vector distance baseline|
|Feedforwad of averages|
|Text LSTM+Linear attention|
|Text LSTM+RNN of nodes|
|Text LSTM+RNN of triplets|
|Text LSTM+RNN of triplets with attention|
|Text LSTM+GCN with attention|
The results of our experiments are presented in Table 3. We took two evaluations for each model and show the average result. The difference between the lowest and highest score varies between and for the different models. In absence of more complete statistics, we choose the middle value
as an estimate for the statistical error to attribute to all our measurements. All results are approximated to the first significant digit of the error.
The simple vector distance baseline is seen here performing narrowly better than random chance, with on the test set. This is not unexpected since the model only captures the distance between two centroids.
The feedforward of averages works much better, with on the test set. Given how little information is fed as an input, the result has more to do with the quality of the Glove vectors than our model.
The other models are more complex, and this additional complexity seem to have non-trivial consequences in the results. For example, the text LSTM + Centroid model scores an on the test set, the third highest in this paper. This is an increase of over the simple feedforward model, meaning that the text LSTM part (the only change from the prior model) is extremely relevant in processing the input text.
The following two models, linear attention and RNN of nodes, do not provide any significant improvement upon the feedforward of averages. The modest results of the linear attention model are particularly interesting, suggesting that the classification task does not seem to rely on specific easy-to-identify nodes, and that the whole node set information seem to play a role for an accurate result.
The second best results of the paper is given by the RNN of triplets model, with on the test set. This model uses the whole graph information taking as an input the set of triplets that compose the Wikidata graph. The RNN of triplets with attention seems to perform even better, reaching . A straightforward conclusion is that the classification task is mildly helped by paying attention on specific triplets. In retrospect this is unsurprising, as relations like instance_of, subclass_of give a relevant hint to the type of entity described by the graph. The improvement is however small, and it seems to indicate that - on average - there is no "critical triplet" for NED.
Conversely, the Text LSTM+GCN model performs poorly, with . The reason of this drop in performance is complex, and we believe it rests in the way GCNs create the final embedding vector. The GCN embeddings sum up information that comes from the graph nodes, edges, and topology of the network. This last piece of information is not considered in the triplet model, and we believe it is what confuses the graph convolutional model: For example, looking at Fig. 1 the GCN might decide that the main difference between the two graphs is in the network topology, not in the content of the nodes; conversely graphs with similar topology can be considered closer to each other than graphs with similar triplets. In short, in our experiments the GCN seems to give too much importance to the shape of the graphs in training, ending up being confused when testing. It seems that the graph convolutional model would perform better in a dataset where the topology of the graphs plays a more relevant role. In our datasets the graphs are simple trees, and the key pieces of information seem to come from relation triplets.
The Text LSTM+GCN with attention model seems to perform about better. The attention model effectively adapts the topology of the network to the input text, alleviating some of the issues with the prior model. Even so, the attentive GCN does not perform well. We would be excited to apply a GCN-based model to a dataset with a richer network topology in a future work.
Disambiguating among similar entities is a crucial feature in extracting relevant information from texts: Ambiguous information needs to be resolved, requiring additional steps that go beyond grammatical parsing. Correctly sieving information from huge corpora of text is especially suited for a Deep learning approach, given the data-hungry nature of neural networks.
EL is however not a recent endeavor. The work of Bunescu and Pasca (Bunescu and Pasca, 2006) introduced the idea of disambiguating text through the use of knowledge bases. An entity in a sentence is compared to entries in a corpus and the correct meaning is resolved by using an appropriate similarity function.
Shortly thereafter, the authors of (Milne and Witten, 2008) introduced the idea of learning to link entities on Wikipedia. In this spirit, the Wiki-Disamb30 dataset (Ferragina and Scaiella, 2010) was created and further used in the TAC2011 Knowledge Base Population Track (McNamee and Dang, 2009).
The previous works use a corpus of unstructured texts (Wikipedia) to disambiguate entities in sentences. However, there is a long tradition of using structured data for entity classification and linking. One example is (Bhagavatula et al., 2015), where they address the problem of entity linking within web tables. Their problem is similar to the the one we tackle in this paper, although restricted to the types of tables one can find in Wikipedia pages or more general html pages. The Wikipedia social network is exploited in (Geiß and Gertz, 2016), where they study the linking of person entities and disambiguation of homonyms. A work closely related to ours is (Usbeck et al., 2014), in which the authors employ a search-based algorithm for the disambiguation task. We refrained however from using the datasets in that paper because the amount of annotated data used therein seemed to be insufficient for training a neural net.
In (Schuhmacher and Ponzetto, 2014) the authors present a novel way to build a semantic graph to represent a document content. Those graphs are then used to rank related entities, as well as providing a document similarity score. The ranking algorithm is built around the idea that similar entities are close to each other in the semantic graph (hopping distance). A similar work is presented in (Ren et al., 2017)
. The authors created a system (CoType) where relations and entities are extracted together by means of comparing the entities in a sentence to the items in a knowledge graph. Furthermore, the task of using semantic graphs to mine topics is explored in(Chen et al., 2017).
The role of deep learning in EL is also studied in some recent works. The authors Globerson et al. invent a novel attention mechanism for entity resolution. In (Raiman and Raiman, 2018) a system of entities that are easy to learn is created, and eventually they are able to improve upon the state-of-the-art in the Wiki-Disamb30 dataset.
The work of Meij et al. minutely reviews the role of EL in a semantic search context, arguing that NED and Entity Retrieval enables modern search engines to organize their wealth of information around entities.
Most EL datasets are based on relatively well-behaved text. The challenge of NED on noisy text is addressed in (Eshel et al., 2017), where they present a new dataset with more realistic html page fragments. Another problem is disambiguating entities in a question answering system. This issue is studied in (Klang and Nugues, 2014).
The dataset we present in this paper is derived from the Wiki-Disamb30 corpus. A comparison with prior results evaluated on the original dataset seems due, albeit somewhat contrived: In the original dataset the context for disambiguation comes from Wikipedia pages, whereas in our work we build an embedding vector from Wikidata graphs.
In (Raiman and Raiman, 2018) the authors summarize recent NED results running the Wiki-Disamb30 dataset using the algorithms of the original papers. They report for (Milne and Witten, 2008) and for (Ferragina and Scaiella, 2010). The state-of-the-art still lies in the work of Raiman and Raiman, where they achieve .
We have shown that it is possible to disambiguate entities in short sentences by looking at the corresponding entries in Wikidata. In order to achieve this result, we created a new dataset Wikidata-Disamb, where we present an equal number of correct and incorrect entity linking candidates.
Our RNN of triplets with attention model allows us to achieve the best result over the test set. This is an improvement from the baseline of the simple feedforward of averages model of about , where the edges of the Wikidata graph are not used.
The main contribution of this improvement seems to come from processing the input text with a Bi-LSTM. Various methods of dealing with the Wikidata graph do not seem to correlate with a big improvement in the results. Indeed, most baseline models - that only consider the nodes of the graph - seem to perform roughly equally. The second biggest improvement happens when including information about the relation type with the RNN of triplets.
The GCN based approaches are seen to perform poorly. In our dataset the topology of the graphs seems to play a secondary role (most of the graphs in the dataset are simple trees), and the performance of the graph convolutional models drops as a result. More interesting graph topologies should make the GCN perform better. We aim to address this issue with different datasets in following works.
In the future, we aim to use similar techniques to pursue disambiguation tasks outside the dataset we created. Moreover, a similar approach can be used to create graph embeddings of Wikidata items, which can be used for improving semantic search tasks (Meij et al., 2014). In addition, cross-language entity linking could be addressed with techniques similar to ours, by using datasets in different languages (Pappu et al., 2017). Our aim is to address these challenges in following works.
This work was partially supported by InnovateUK grant Ref. 103677.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.http://tensorflow.org/ Software available from tensorflow.org.
|Final hidden layer size||dim|
|Word vectors size||dim|
|Final hidden layer size||dim|
|Dense layer before text LSTM||dim|
|Text LSTM memory||() dim|
|Word vectors size||dim|
|Node vectors size||dim|
|Final hidden layer size||dim|
|dense layer before text LSTM||dim|
|Dext LSTM memory||() dim|
|Word vectors size||dim|
|Triplet vectors size||dim|
|Final hidden layer size||dim|
|Dense layer before text LSTM||dim|
|Text LSTM memory||() dim|
|Dense layer before graph LSTM||dim|
|Graph LSTM memory||() dim|