Knowledge graphs have proven useful for many applications, including document retrieval Dalton et al. (2014) and question answering Mohammed et al. (2018). As there already exist many large-scale efforts such as Freebase Bollacker et al. (2008), DBpedia Auer et al. (2007), Wikidata Vrandečić and Krötzsch (2014), and YAGO Suchanek et al. (2007), to support interoperability there is a need to match entities across multiple resources that refer to the same real-world entity. Addressing this challenge would, for example, allow mentions in free text that have been linked to entities in one knowledge graph to benefit from knowledge encoded elsewhere.
Ambiguity, of course, is the biggest challenge to this problem: for example, there are persons named “Adam Smith” in DBpedia and in Wikidata. The obvious solution is to exploit the context of entities for matching. In this paper, we present two datasets for entity matching between DBpedia and Wikidata that specifically focus on ambiguous cases. Interestingly, experimental results show that with a classification-based formulation, an off-the-shelf graph embedding, RDF2Vec Ristoski and Paulheim (2016), combined with a simple multi-layer perceptron (MLP) achieves high accuracy on these datasets.
We view this short paper as having the following two contributions: First, we offer the community two large-scale datasets for entity matching, focused on ambiguous entities. Second, we show that a simple model performs well on these datasets. Results suggest that RDF2Vec
can capture the context of entities in a low dimensional semantic space, and that it is possible to learn associations between distinct semantic spaces (one from each knowledge graph) using a simple MLP to perform entity matching with high accuracy. Experiments show that only small amounts of training data are required, but a linear model (logistic regression) on the same graph embeddings performs poorly. Naturally, we are not the first to have worked on aligning knowledge graphs (see discussion in Section5). While more explorations are certainly needed, to our knowledge we are the first to make these interesting observations.
2 Problem Formulation
We begin by formalizing our entity matching problem. Given a source knowledge graph containing entities , where is a Uniform Resource Identifier (URI), for each entity we wish find the entity in the target knowledge graph containing entities that corresponds to the same real-world entity—the common-sense notion that these entities refer to the same person, location, etc. In our current formulation, we take a query-based approach: that is, for a given “query” entity in the source knowledge graph, our task is to determine the best matching entity in the target knowledge graph. Although, in principle, a query entity from the source knowledge graph may be correctly mapped to more than one entity in the target knowledge graph, such instances are rare and we currently ignore this possibility.
To study the entity matching problem, we began by creating two benchmark datasets exploiting OWL:sameAs predicates that link entities between DBpedia (2016-10)111https://wiki.dbpedia.org/downloads-2016-10 and Wikidata (2018-10-29).222https://dumps.wikimedia.org/wikidatawiki/entities/20181029/ These predicates are manually curated and can be viewed as high-quality ground truth. The total number of mappings obtained by querying DBpedia and Wikidata using SPARQL was 6,974,651. We then removed all mappings referring to Wikipedia disambiguation pages.
Although entities with different names in two knowledge graphs may refer to the same real-world entity, we focus on the ambiguity problem and hence restrict our consideration to entities in knowledge graphs that share the same name—more precisely, the foaf:name predicate in DBpedia and the rdfs:label predicate in Wikidata. Furthermore, to make our task more challenging, we only consider ambiguous cases (since string matching is sufficient otherwise). To accomplish this, we first built two inverted indexes of the names of all entities in DBpedia and Wikidata to facilitate rapid querying. Our problem formulation leads to the construction of two datasets, corresponding to entity matching in each direction:
DBpedia to Wikidata: Here, we take DBpedia as the source knowledge graph and Wikidata as the target. For each entity in DBpedia, we queried the above index to retrieve entities with the same name in Wikidata, which forms a candidate set for disambiguation. Since our focus is ambiguous entities, we discard source DBpedia entities in which there is only one entity with the same name in Wikidata. For example, there are several people with the name “John Burt”: John Burt (footballer), John Burt (rugby union), John Burt (anti-abortion activist), and John Burt (field hockey). Of these, only one choice is correct, which is determined by the owl:sameAs predicate: this provides our positive ground truth label. Thus, by construction in each candidate set there is only one positive candidate and at least one negative candidate. This dataset contains 376,065 unique DBpedia URIs comprising the queries with a total of 232,757 unique names, and 967,937 unique Wikidata URIs as candidates.
Wikidata to DBpedia: We can apply exactly the same procedure as above to build a dataset with Wikidata as the source and DBpedia as the target. The resulting dataset contains 329,320 unique Wikidata URIs comprising the queries with a total of 293,712 unique names and 523,517 unique DBpedia URIs as candidates.
Finally, we shuffle and split the data into training, validation, and test sets with a ratio of 70%, 10%, and 20%, respectively. Statistics are summarized in Table 1, and we make these datasets publicly available.333https://github.com/MichaelAzmy/ambiguous-dbwd Figure 1
shows the number of query (source) entities with different numbers of candidate (target) entities. We see Zipf-like distributions: although most query entities have only a modest number of candidates, there exist outliers with hundreds or more candidates.
Note that by construction, our datasets for evaluating entity matching cannot be solved by NLP techniques based on text alone, since the source entities and target entities share exactly the same name (thus string matching conveys no information). The context for disambiguation must come from some non-text source (in our case, captured in graph embeddings).
|DB to WD||263,245||37,607||75,213||376,065|
|WD to DB||230,523||32,933||658,64||329,320|
3 Classification Model
We propose a classification approach to tackle the entity matching problem across knowledge graphs. Here, we use a point-wise training strategy: a classifier is trained on each source–target entity pair and the probability of predicting a match is used for candidate ranking.
A graph embedding is used to represent nodes of a graph in some low dimensional semantic space while preserving some aspect of its structure. Different graph embedding techniques have been introduced recently to capture different aspects of the graph structure. In our case, we need to preserve the structural as well as the semantic features of the nodes (entities) so that semantically-similar nodes are close to each other in the embedding space. For this, we decided to use the RDF2Vec Ristoski and Paulheim (2016) graph embedding technique.
In RDF2Vec, the RDF graph is first “unfolded” into a set of sequences of entities with predicates connecting them, forming natural language sentences. This is typically performed using two approaches: graph walks and Weisfeiler-Lehman Subtree RDF graph kernels. After that, the generated sentences are used to train a Word2Vec Mikolov et al. (2013a, b) model over the natural language output. The outcome of this step is a d
-dimensional vector for each entity (i.e., node in the knowledge graph).
After the above process, the embedding of the query entity in the source knowledge graph and the candidate entity in the target knowledge graph are concatenated into one feature vector and then fed into a multi-layer perception with one hidden layer using the ReLU activation function, followed by a fully-connected layer and softmax to output the final prediction. The model is trained using the Adam optimizerKingma and Ba (2014), and negative log-likelihood loss is used. Each pair of training example is associated with the ground truth from the datasets described in the previous section. We rank the candidates by the match probability for evaluation. As a baseline, we compare our MLP with a simple logistic regression (LR) model over the same input vectors.
4 Experimental Evaluation
We use the datasets introduced in Section 2 to evaluate our model. For each query, there is one positive candidate and several negative candidates. Entities in DBpedia and Wikidata are embedded independently, and thus matching entities have two different embeddings, one in each knowledge graph. We use pretrained embeddings444http://data.dws.informatik.uni-mannheim.de/rdf2vec/
with hyperparameterswalks and depth . The embeddings were trained using the skip-gram model with , which showed good results in Ristoski and Paulheim (2016). If an entity has no pretrained embedding, a randomly-initialized vector is used.
We evaluate the model on the test set with the best configuration tuned on the validation set, using the entire training set. The MLP hidden layer has size 750, with a learning rate of and a batch size of . Mean reciprocal rank (MRR) is used for evaluation.
Overall, on the test set, the MLP achieves 0.85 MRR matching DBpedia entities to Wikidata and 0.81 MRR matching Wikidata entities to DBpedia. In comparison, logistic regression fails to learn a good decision boundary and achieves only 0.64 MRR and 0.62 MRR, respectively. Note that the embeddings of each knowledge graph are learned separately, which means that our model is not simply learning to match words in semantic relations—but actually learning correspondences between two semantic spaces. For reference, a random guessing baseline yields 0.25 MRR and 0.32 MRR, respectively.
Figure 2 breaks down MRR according to entity type. We observe that types Album and MusicalWork yield worse results than the others, primarily because of greater ambiguity. These two types have larger candidates sizes, averaging 12.1 and 11.0 respectively, compared to persons, whose average candidates size is only 6.5 for the DBpedia-to-Wikidata dataset.
Based on error analysis, we observe that our model lacks the fine-grained ability to disambiguate entities in the same type/domain in some cases. For example, in the music domain, our model cannot differentiate the record company Sunday Best and the single Sunday Best by Megan Washington. Overall, in the validation set of the DBpedia-to-Wikidata dataset, for cases where the model fails to place the correct entity at rank one but succeeds at rank two instead, 79.5% of cases have the same type of entity in both positions.
In our next analysis, we investigate the effect of the size of the candidate sets against matching accuracy. We measure the MRR of the MLP model on the validation sets, broken down into buckets according to different numbers of candidates, summarized as boxplots. This is shown in Figure 3: as expected, MRR decreases overall as the number of candidates increases.
Finally, we wish to examine the effects of training data size. Figure 4 shows the effects of changing the size of the training set with percentages on the MRR evaluated on the validation sets. In each case, we randomly divide the sampled data into training/validation splits while preserving the 70:10 ratio. We repeat the sampling and run the models times for the first four points and
times for the rest; 95% confidence intervals are shown in the plot as shaded regions. We observe that with a small amount of training data, the MLP model can achieve reasonable MRR. For example, with only, the MLP model can achieve and MRR, compared to and on the entire training set in the DBpedia-to-Wikidata and Wikidata-to-DBpedia datasets, respectively.
5 Related Work
Research related to knowledge graph integration comes from the database community and focuses on ontology matching—referred to as record-linkage, entity resolution, or deduplication. Examples include Magellan Konda et al. (2016), DeepER Ebraheem et al. (2018), and the work of Mudgal et al. (2018). The primary difference between this work and ours is that they assume relational structure and that the tables to be matched have been already aligned using schema matching techniques. These systems cannot be directly applied to entity matching across knowledge graphs due to differences in structure between the relational model and the RDF model.
The Semantic Web community has studied the problem of matching entities across knowledge graphs, for example, the Ontology Alignment Evaluation Initiative (OAEI) on ontology matching in knowledge graphs. However, the benchmarks used in these evaluations are quite small. For example, the spimbench benchmark Saveta et al. (2015) has a total of only 1800 instances and 50,000 triples.
According to Castano et al. (2011), entity matching on knowledge graphs can be classified into: (1) value-oriented approaches that define the similarity between instances on the attribute level and an appropriate matching technique is used based on attribute type, and (2) record-oriented approaches which include learning-based, similarity-based, rule-based, and context-based techniques.
The best approaches in OAEI 2017 either rely on logical reasoning as in Jiménez-Ruiz and Grau (2011) or on textual features as in Achichi et al. (2017). In contrast, our work differs in the following ways: (1) we use graph embeddings to capture the semantics and structure of the knowledge graphs without the need for hand-crafted features, (2) our system does not require any schema mappings, and (3) our approach can take advantage of the graph nature of RDF, including knowledge about connectivity between nodes and how they relate to one another.
We explore the problem of entity matching across knowledge graphs, sharing with the community two benchmark datasets and a baseline model. Although quite simple, our model reveals some insights about the nature of this problem and paves the way for future work.
This research was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada.
- Achichi et al. (2017) Manel Achichi, Zohra Bellahsene, and Konstantin Todorov. 2017. Legato: results for OAEI 2017. In Proceedings of the Twelfth International Workshop on Ontology Matching.
- Auer et al. (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: a nucleus for a web of open data. In The Semantic Web, pages 722–735. Springer.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pages 1247–1250, Vancouver, British Columbia, Canada.
- Castano et al. (2011) Silvana Castano, Alfio Ferrara, Stefano Montanelli, and Gaia Varese. 2011. Ontology and instance matching. In Knowledge-Driven Multimedia Information Extraction and Ontology Evolution, pages 167–195. Springer.
- Dalton et al. (2014) Jeffrey Dalton, Laura Dietz, and James Allan. 2014. Entity query feature expansion using knowledge base links. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2014), pages 365–374, Gold Coast, Australia.
- Ebraheem et al. (2018) Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, and Nan Tang. 2018. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 11(11):1454–1467.
- Jiménez-Ruiz and Grau (2011) Ernesto Jiménez-Ruiz and Bernardo Cuenca Grau. 2011. LogMap: logic-based and scalable ontology matching. In Proceedings of the 10th International Semantic Web Conference, pages 273–288, Bonn, Germany.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. arXiv:1412.6980.
- Konda et al. (2016) Pradap Konda, Sanjib Das, Paul Suganthan GC, AnHai Doan, Adel Ardalan, Jeffrey R Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, et al. 2016. Magellan: toward building entity matching management systems. Proceedings of the VLDB Endowment, 9(12):1197–1208.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781v3.
- Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119.
Mohammed et al. (2018)
Salman Mohammed, Peng Shi, and Jimmy Lin. 2018.
Strong baselines for simple question answering over knowledge graphs with and without neural networks.In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 291–296, New Orleans, Louisiana.
- Mudgal et al. (2018) Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: a design space exploration. In Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data, pages 19–34, Houston, Texas.
- Ristoski and Paulheim (2016) Petar Ristoski and Heiko Paulheim. 2016. RDF2Vec: RDF graph embeddings for data mining. In Proceedings of the 15th International Semantic Web Conference, pages 498–514, Kobe, Japan.
- Saveta et al. (2015) Tzanina Saveta, Evangelia Daskalaki, Giorgos Flouris, Irini Fundulaki, Melanie Herschel, and Axel-Cyrille Ngonga Ngomo. 2015. Pushing the limits of instance matching systems: A semantics-aware benchmark for linked data. In Proceedings of the 24th International Conference on World Wide Web, pages 105–106, Florence, Italy.
- Suchanek et al. (2007) Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. YAGO: a core of semantic knowledge unifying WordNet and Wikipedia. In Proceedings of the 16th International Conference on World Wide Web, pages 697–706, Banff, Alberta, Canada.
- Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85.