Enriching Linked Datasets with New Object Properties

Although several RDF knowledge bases are available through the LOD initiative, the ontology schema of such linked datasets is not very rich. In particular, they lack object properties. The problem of finding new object properties (and their instances) between any two given classes has not been investigated in detail in the context of Linked Data. In this paper, we present DART (Detecting Arbitrary Relations for enriching T-Boxes of Linked Data) - an unsupervised solution to enrich the LOD cloud with new object properties between two given classes. DART exploits contextual similarity to identify text patterns from the web corpus that can potentially represent relations between individuals. These text patterns are then clustered by means of paraphrase detection to capture the object properties between the two given LOD classes. DART also performs fully automated mapping of the discovered relations to the properties in the linked dataset. This serves many purposes such as identification of completely new relations, elimination of irrelevant relations, and generation of prospective property axioms. We have empirically evaluated our approach on several pairs of classes and found that the system can indeed be used for enriching the linked datasets with new object properties and their instances. We compared DART with newOntExt system which is an offshoot of the NELL (Never-Ending Language Learning) effort. Our experiments reveal that DART gives better results than newOntExt with respect to both the correctness, as well as the number of relations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/19/2019

The Linked Open Data cloud is more abstract, flatter and less linked than you may think!

This paper presents an empirical study aiming at understanding the model...
10/27/2017

Enhancements of linked data expressiveness for ontologies

The semantic web has received many contributions of researchers as ontol...
06/19/2019

Observing the LOD Cloud using Equivalent Set Graphs: the LOD Cloud is mostly flat and sparsely linked

This paper presents an empirical study aiming at understanding the model...
03/07/2016

TruthDiscover: Resolving Object Conflicts on Massive Linked Data

Considerable effort has been made to increase the scale of Linked Data. ...
12/10/2021

Jekyll RDF: Template-Based Linked Data Publication with Minimized Effort and Maximum Scalability

Over the last decades the Web has evolved from a human-human communicati...
07/24/2017

eLinda: Explorer for Linked Data

To realize the premise of the Semantic Web towards knowledgeable machine...
04/13/2018

Monitoring and Executing Workflows in Linked Data Environments

The W3C's Web of Things working group is aimed at addressing the interop...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The Linked Data initiative provides a set of guidelines and best practices for publishing structured data and representing attribute values and relations among a set of entities. The Linking Open Data (LOD) community project111http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData works with the main objective of publishing open datasets as RDF triples and establishing RDF links between entities (aka objects) from different datasets. LOD complements the world wide web with a data space of entities connected to one another with labelled edges, which represent the relations among entity pairs (or entities and literal values). Many organizations have built systems to exploit the power of Linked Data for specific purposes. For example, the British Broadcasting Corporation (BBC) uses linked datasets such as DBpedia (Lehmann et al., 2015) to enable cross-domain navigation and enhanced search222https://www.w3.org/2001/sw/sweo/public/UseCases/BBC/ in their websites. IBM has been using Linked Data as an integration technology for several years and their new cognitive system, Watson, has DBpedia and YAGO (Suchanek et al., 2007) as part of its major data sources (Ferrucci et al., 2010).

Currently, most linked datasets are rich in A-Box assertions but poor in T-Box information i.e they have a very weak ontology schema. They especially lack object properties. For example, the linked dataset YAGO has 488,469 classes (Mahdisoltani et al., 2015). Among such a huge number of classes, surprisingly there are only 32 object properties333http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/statistics/ - totally there are 60 object properties, but 28 of them connect the domain class to the class http://dbpedia.org/class/yago/YagoLiteral and hence looking for more object properties to connect these classes becomes an interesting task. Adding more object properties to the ontology schema will help in enriching the domain being represented in the linked dataset. Question answering systems can make use of these additional relations to answer more number and also a wider range of questions. To realize the full potential of Linked Data in various applications, it is important to enrich LOD with as many appropriate ontological axioms and assertions as possible.

Most of the Linked Data enrichment works (surveyed in (Paulheim, 2017)) focus on adding more instances to existing object properties (in this paper, the term ‘relation’ is used as a synonym of ‘object property’). There are not many techniques available in the literature that identify new relations, given two LOD classes.

The systems proposed in ((Mohamed et al., 2011), (Barchi and Hruschka, 2014)) for the purpose of extending the NELL ontology, OntExt and newOntExt respectively, can be adapted to the Linked Data settings to discover new object properties between given LOD classes. However, we found the following issues with their working: newOntExt tends to miss out important relations. It seeks to represent relations as text patterns and cluster patterns based on how frequently they co-occur with a pair of entities in a text corpus. The system tends to group dissimilar patterns into the same cluster and finally only the representative relation of the cluster is output by the system as a newly discovered relation. For example, given the classes athletes and sportsleagues as inputs, newOntExt places the relations “doesn’t play at” (currently not playing) and “wants to play at” (wish to play) in the same cluster (Navarro, 2016) because these two relations occur between the same subject-object pairs with a high frequency. Hence, only one of them gets selected as the cluster’s representative relation, though both of them are correct relations, but with different meanings. Also, newOntExt does not do any contextual check to see if the pattern actually fits the context of the given two classes. For example, between the classes Languages and Countries, an incorrect pattern “are people living in” is obtained from the web corpus as “Chinese” can refer to both the language as well as the ethnic group. newOntExt does not perform any contextual check to eliminate such a pattern.

In this paper, we present DART (Detecting Arbitrary Relations for enriching T-Boxes of Linked Data) which adopts an unsupervised approach in order to discover and add new object properties and their instances to a linked dataset. DART exploits contextual similarity tools and paraphrase detection in order to identify the correct set of text patterns which are most-likely to be useful as object properties between the two given LOD classes. Additionally, it grounds the relations to the linked dataset in order to identify the completely new relations and is also capable of generating candidate property axioms. By grounding, we mean mapping of discovered relations to existing LOD object properties.

To summarize, our contributions include the following:

  1. Given two classes belonging to a linked dataset, the proposed system DART discovers relations between them by exploiting text patterns from the web corpus, hence enriching the T-Box of the linked dataset. For example, given the two classes, Religions444http://dbpedia.org/class/yago/Religion105946687 and Countries555http://dbpedia.org/class/yago/Country108544813, DART generates relations such as “became the official religion in”, “is the predominant religion in” etc.

  2. For each generated relation, a set of paraphrases are also generated that can be used to extract additional instances of the relation.

  3. DART produces instances of the newly generated relations, leading to the enrichment of the A-Box. Continuing with the above example, it can add triples of the form (Hinduism, became the official religion in, Nepal), (Christianity, is the predominant religion in, Australia) etc.

  4. A completely automated technique for grounding of the generated relations in the linked dataset has been proposed which also suggests T-Box axioms for the newly generated relations. For example, in the case of Empires666http://dbpedia.org/class/yago/Empire108557482 and Rulers777http://dbpedia.org/class/yago/Ruler110541229, DART infers that the newly generated relation “was ruler of” might be a sub-property of the YAGO property “isLeaderOf”.

  5. Through the process of grounding, DART also eliminates irrelevant and ambiguous relations.

Our experiments show that DART gives much better results than newOntExt in terms of both precision and recall on input classes belonging to different domains. DART is also capable of suggesting insightful property axioms.


The rest of the paper is structured as follows: Section 2 describes the related works from the literature. Section 3 gives an account of the working of DART with each phase of the approach explained in detail. The experiments conducted by us in order to evaluate the effectiveness of the approach are presented in Section 4 along with the comparison of DART with newOntExt. Conclusions drawn from the work are given in Section 5.

2. Related Works

Relation enrichment (of those other than the owl:sameAs links) of the linked datasets for the purpose of the overall growth of the LOD cloud has been the major focus in many recent works (surveyed in (Paulheim, 2017)). Most of the relation enrichment approaches surveyed in (Paulheim, 2017) focus on extracting more instances (subject-object pairs) of existing relations in the linked datasets. Works such as ((Muñoz et al., 2014), (Muñoz et al., 2013)), (Ritze et al., 2015), ((Syed et al., 2010), (Mulwad et al., 2010a), (Mulwad, 2010) and (Mulwad et al., 2010b)) and (Limaye et al., 2010) use the technique of interpreting web tables for this purpose and a few other works such as (Suchanek et al., 2009) and (Krause et al., 2012a) propose using various semi-supervised approaches for the same. Distant supervision is another new paradigm which has been recently adopted by many works ((Krause et al., 2012b), (Mintz et al., 2009), (Aprosio et al., 2013), (Assis and Casanova, 2014), (Nguyen and Moschitti, 2011)

) in order to extract more instances of existing relations. Distant supervision is the technique of utilizing a large number of known facts (from a huge linked dataset such as Freebase) for automatically labeling mentions of these facts in an unannotated text corpus, hence generating training data. A classifier is learnt based on this weakly labeled training data in order to classify unseen instances

(Krause et al., 2012b).

Apart from enriching the datasets with additional instances of already existing relations, two other less-explored problems of relation enrichment are: (1)finding instances of specified new relations and (2) discovering arbitrary new relations.

By instances of specified new relations, we mean that the relation is not present in the dataset currently but the name of the relation is given to the system and the system needs to add instances of such a relation to the dataset. The technique proposed in (Jain et al., 2012) to detect instances of “part-of” (partonomy) relation between linked data instances falls under this category. Similarly, the SILK link discovery framework (Bizer et al., 2009) which is primarily used to detect owl:sameAs links is also capable of detecting instances of user-specified relations. It uses its own declarative language, Silk - Link Specification Language (Silk-LSL) in order to specify the two datasets between which the links ought to be found and to give the link type. Coming to the second problem of discovering arbitrary new relations, it can be defined as the task of finding any or all possible relations between two given classes. We find that it has not been tackled by many works. It is precisely this problem we address in this paper. It is to be noted that the system is not aware of the possible relations between the concerned classes before-hand and hence such relations are termed as arbitrary relations (as defined in (Etzioni et al., 2011)). There are two systems, OntExt (Mohamed et al., 2011) and newOntExt ((Barchi and Hruschka, 2014), (Barchi and Hruschka, 2015)) which have been proposed in the context of helping NELL to extend its ontology by means of discovering new relations between the ontology classes. They are described below:

OntExt: Given two noun categories ((Carlson et al., 2010) calls classes as noun categories), and their instances, OntExt discovers relations between them by exploiting the notion that similar patterns occur between the same subject-object pairs. For example, if the patterns “Ganges flows through Allahabad” and “Ganges in the heart of Allahabad” occur in the web corpus with a very high frequency then this can be taken as an indicator that the patterns, “flows through” and “in the heart of” are similar to each other. When such an evidence is shown by many number of subject-object pairs, OntExt gives a very high similarity score between the two patterns. In general, OntExt works in the following manner: given a pair of categories and a set of sentences-each containing a pair of instances known to belong to the given categories, OntExt collects the words in between the instances from each sentence and calls these words a “context-pattern”. Then it builds a context-pattern by context-pattern co-occurrence matrix based on the frequencies of occurrence of these context-patterns with the same subject-object instance pairs. For example, in the above case of finding relations between Rivers and Cities, if the pair “Ganges” and “Allahabad” occurs with the context-pattern “flows through” with a frequency and the pair occurs with the pattern “in the heart of” with a frequency , then the matrix entry corresponding to these two context-patterns will be given a value of . In case there is another subject-object pair (for example- Thames, London) occurring with both these context-patterns with frequencies and respectively, then the matrix cell value becomes

. K-means clustering is applied on the normalized matrix to group the related context-patterns together. The centroid of each cluster is proposed as a new relation. Then the subject-object pairs are ranked based on how often they occur along with each context-pattern using the formula in equation (1). Finally, the top 50 subject-object pairs are given as seed instances of the new relation to NELL

(Carlson et al., 2010).
Weight of a (subject,object) pair “s”

(1)

Where,
cluster is the cluster of pattern contexts for the given new relation,
Occ(c,s) is the number of times instance “s” co-occurs with the context pattern “c”,

sd(c) is the standard deviation of the context pattern from the centroid of the pattern cluster


As more than half of the relations generated by OntExt were invalid (determined manually in (Mohamed et al., 2011)), the authors of OntExt have proposed a classifier which can differentiate between valid and invalid relations to some extent.

newOntExt: newOntExt which was developed based on OntExt had a few changes in its working (Barchi and Hruschka, 2014): instead of considering all the words in between the two input instances as a pattern, newOntExt used ReVerb (Etzioni et al., 2011) for extracting the patterns in order to reduce the number of noisy patterns obtained; for optimising the computational cost, a more elegant file structure was used for searching through the sentences; instead of considering every pair of categories as input to this system, reduced category groups of interest were formed to pick the input category pairs.

A major difference between DART and newOntExt is that the latter takes co-occurrence values of the patterns to be an indicator of the semantic similarity between them whereas DART computes the semantic similarity by means of paraphrase detection techniques. It should be noted that DART does not rely upon the lexical similarity of the patterns i.e DART can detect the semantic similarity even if the two patterns have disjoint set of words. In addition to this, DART also performs grounding of relations and generation of candidate property axioms. Comparison of DART with newOntExt is described in Section 4.1.

3. Working of DART

3.1. Pre-processing

Given two classes D1 and D2, we need patterns occurring in the web corpus along with the instances of D1 and D2, in order to discover the possible relations between them. Hence for this purpose, we obtain (subject, predicate, object) triples - known as a triple corpus C from the RCE 1.1 file 888ReVerb ClueWeb Extractions 1.1: dataset consisting of 15 million triples produced by running ReVerb on the English portion of ClueWeb09 corpus, such that the subject and object belong to D1 (D2) and D2 (D1) respectively. We have used the RCE dataset in our experiments as newOntExt employs ReVerb and we wished to maintain uniform set of inputs for both DART and newOntExt for a fair comparison. However we can also replace this step in the following manner: use a web corpus such as ClueWeb and extract sentences containing instances of D1 and D2; then apply any triplification tool such as ClausIE (Del Corro and Gemulla, 2013), Ollie (Mausam et al., 2012) etc to obtain the input triples corpus C.

We also store the direction of these triples in C, i.e if the subject of the triple belongs to D1 and the object belongs to D2, then the direction is marked as “forward”. If subject belongs to D2 and object belongs to D1, the direction is marked as “reverse”.

3.2. Relation discovery phase

Relation discovery phase, given in Algorithm LABEL:algo:RelDisc, takes the corpus C as input and outputs clusters of synonymous relations. We collect all the unique predicates in C (let us call them “patterns”) and filter them based on whether they are suitable for the given input domain or not (Lines 1-7) i.e a contextual similarity check is performed in the following manner: in each pattern, all the function words 999http://www.sequencepublishing.com/academic.html are removed (as they are not context-specific words) and the remaining words are checked for similarity with the domain name. For example, let us assume that the user intends to find the relations between a set of rivers and a set of cities, and the user-specified domain name is “river”. If the pattern under consideration is “rises in”, DART checks the similarity of “rises” (as the other word “in” is a functional word) with “river” and if this similarity crosses a certain threshold (more details on how this threshold was fixed are given in Section 4), DART includes this pattern else discards it. We use the Word2Vec (Mikolov et al., 2013) model proposed and trained by Google 101010https://code.google.com/archive/p/word2vec/ for finding the contextual similarity. The intuition behind this step is that, patterns not relevant to the domain obtained from the web corpus can be eliminated by checking if the contexts of the pattern and the domain name are close to each other, i.e this serves as a pseudo disambiguation step. The filtered patterns are then subjected to single pass clustering (Frakes and Baeza-Yates, 1992). Single pass clustering works as follows (Lines 8-33): Take each pattern “p” and check its semantic similarity with the representative relations of all the clusters. Place “p” in the cluster whose representative relation has the maximum similarity with it. Now recompute the representative relation for this augmented cluster in the following manner - representative relation is the pattern which has the maximum average similarity with the other patterns in that cluster. If the maximum similarity value is lesser than a fixed threshold value (=0.5), place “p” in a new cluster.

In order to determine the semantic similarity between two patterns, we modified the paraphrase detection technique proposed by Mihalcea et al. (Mihalcea et al., 2006): We have eliminated the word specificity weights. In (Mihalcea et al., 2006), the individual word-to-word similarity values were weighted using a word specificity measure so that higher importance can be given to a semantic matching identified between two specific words such as “collie” and “sheepdog” when compared to a matching identified between words such as “get” and “become”. In the context of DART, words such as “get” and “become” (any verb in general) have a good chance of occurring in the input patterns as the aim of DART is to extract relations between classes. Hence, giving a low weight to such words (as done in (Mihalcea et al., 2006)) is not appropriate in the context of DART. The formula used to determine similarity of patterns in our work is given in equation (2).

(2)

where,
and represent the input text segments,
refers to the similarity value of the word in which is most similar to the word in the other text segment,
refers to the number of words in

In our implementation, the threshold value chosen to consider two segments and to be similar is 0.5 (adopted from (Mihalcea et al., 2006)).

LESK (Banerjee and Pedersen, 2002) has been used to perform the word-to-word similarity component of equation (2), as it works for all combinations of parts of speech. The representative relations of the clusters obtained at the end of this phase form the relations between the two given classes.

algocf[htbp]    

3.3. Grounding of relations

Once the relation discovery phase generates relations between the two input classes, the system needs to check if these relations can be grounded in the linked dataset, i.e whether they can be mapped to some existing property in the linked dataset. Only if a relation cannot be grounded (mapped to existing LOD properties), it is added as a new relation leading to T-Box enrichment of the linked dataset. In order to map a representative relation with existing LOD properties, DART checks the semantic similarity between an LOD property and the representative relation (using equation (2), but with an increased threshold of 0.75 as we want to avoid spurious mappings). If the similarity value crosses 0.75, then the similarity between and every relation in the cluster of is determined. Finally, if more than 50% of the relations in ’s cluster have a similarity value ¿= 0.75 with , then is said to be grounded and matched to .

In order to determine the domain and range of the LOD property to which the relation was grounded, we use the ontology of the linked dataset (if the ontology lacks this information, we use the system proposed in (Töpper et al., 2012)). The domain and range of the grounded relation are the 2 input classes. Then these grounded relations are handled by DART in two ways: If the grounded relation and the matched LOD property have the same domain and range then it means we have detected a new equivalent property to . If the domain of is the range of and if the range of happens to be the domain of , then it means is a new inverse property for . If the domain (and/or range) of is a subclass of the domain (and/or range) of , then it means we have discovered a new sub-property for the LOD property . Hence in these cases, we don’t discard the relation completely. More relation instances of are produced by DART in the Triple-Finding phase (Section 3.4). However if the domain and range do not match in any of the above mentioned ways, then we consider the grounded relation “r” as an ambiguous or irrelevant (noisy) relation and hence completely discard it. In this paper, we call a relation ambiguous if it holds between 2 or more pairs of classes. For example, the relation “caused by” is an ambiguous relation as it is a meaningful relation between the classes (Event, Event) as well as between the classes (Disease, Drug).

Note that there are a few other works in the literature (such as (Färber et al., 2016)) which focus mainly on the grounding of relations in a Knowledge Base (KB). However, the goal of such systems and the goal of DART are very different from each other with respect to grounding - the former kind of systems extract triples from an external source (such as text) and attempt to ground the relations. They retain only the grounded relations and consider the non-grounded relations as irrelevant to the schema and hence discard them. In our system, we attempt to ground the discovered relations in a linked dataset in order to achieve three things: identify irrelevant relations among the grounded relations and discard them; align the remaining grounded relations to the ontology schema and generate prospective axioms such as inverse, subproperty etc.; identify the non-grounded (new) relations and add them to the T-Box. Moreover in systems such as (Färber et al., 2016), grounding of relations is based on the grounding of entities. We do not use this approach as it will not help us to identify irrelevant and ambiguous relations. Also, the method in (Färber et al., 2016) is semi-automatic i.e a human is involved to decide whether “buy” can be mapped to the relation “acquired”. On the contrary, DART performs this phase in an automated fashion.

3.4. Triple finding phase

In this phase, we intend to find all triples (s,p,o) where p is a relation found in the previous phase, hence enriching the A-Box of the linked dataset. In order to obtain the instances of a new relation (let us call this “p”), each relation “r” in p’s cluster is looked up in the corpus C, and the subject-object pair found in C for the relation “r” is given as an instance to “p”. Hence becomes the final triple. One thing to be noted here is if the relation looked up in the corpus (“r”) is of forward direction and the relation “p” is of reverse direction (notion of directions explained in the Section 3.1), then the final triple given as output becomes .

4. Experiments and Results

The proposed system, DART, has been implemented in Java 1.7 and all experiments have been conducted on a Linux system equipped with an Intel 3.20 GHz quad-core processor and 32 GB main memory. All details regarding the input classes, the relations and relation instances obtained can be found in our project web page111111https://sites.google.com/site/ontoworks/projects.

The experiments conducted on the NELL Knowledge Base for the purpose of comparing DART with newOntExt, and the observations made are given in Section 4.1. The details about the experiments held to gauge DART’s performance on LOD classes are given in Section 4.2.

4.1. Comparison with newOntExt

Though the primary aim of DART is to enrich linked datasets such as DBpedia, YAGO etc., in this subsection, we have conducted experiments on collections of entities belonging to the NELL Knowledge-Base121212NELL.08m.1050.esv.csv “every belief in the KB” file downloaded from http://rtw.ml.cmu.edu/rtw/resources on 26th April 2017 in order to compare DART against the newOntExt system. We have used the implementation of newOntExt provided by the authors of (Barchi and Hruschka, 2014)131313https://github.com/MaLL-UFSCar/ontext. The systems have been compared using two measures, accuracy and the number of meaningful (the terms “meaningful” and “correct” have been used synonymously in the paper) relations obtained. Accuracy is taken as the ratio of the correct relations (as determined by human evaluators) to the total number of relations obtained.

For newOntExt, the value of k used in the k-means clustering of patterns (see Section 2) affects the quality of the relations obtained to a large extent. Since the value of k used is not mentioned in ((Barchi and Hruschka, 2014), (Barchi and Hruschka, 2015)) and has been fixed in a dataset-specific manner in (Mohamed et al., 2011), we have applied the Elbow method (Rajaraman and Ullman, 2011) to determine the best k value (from a range of k=3 to k=29) for clustering the patterns, for each experiment. For DART, the threshold used for checking the contextual similarity using Word2Vec has an impact on the quality of the relations obtained. Hence we conducted experiments for each input class pair for 5 different thresholds - 0.1, 0.2, 0.3, 0.5 and 0.7. We observed that the thresholds of 0.3, 0.5 and 0.7 give very meaningful but very less number of relations. On the other hand, setting the threshold to 0.1 gives very high number of relations (around 130) but most of them are noisy, irrelevant relations. Therefore we decided to use the threshold of 0.2 uniformly for our experiments in order to maintain a good trade-off between the correctness and the number of relations obtained (however, the user can choose to vary this threshold depending on the requirements of the application). For the evaluation, the relations were presented in this format: ¡classname¿ relation ¡classname¿(for example, ¡rivers¿ flows through ¡cities¿) and three ontology engineers were assigned to evaluate them on a two-valued scale: correct, and incorrect. We required that all the three evaluators agree that a relation is correct in order for it to be counted as correct. Table 1 gives details about the input categories and a few sample relations obtained through DART. We have chosen the input categories such that they belong to different domains (Geography, Industries and Medicine) in order to demonstrate the versatility of the proposed system. Also, these particular categories were chosen from their respective domains to ease the process of manual evaluation. Table 2 gives the accuracy and the number of correct relations obtained through DART and newOntExt.

D1 (size) D2 (size) Sample relations through DART
Rivershttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:river (21059) Citieshttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:city (26119) “flows through”, “is just a few miles west of”, “drowned in”
Languageshttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:language (11278) Countrieshttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:country (3064) “are spoken in”, “is a common language in”, “is an official language in”
Vegetableshttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:vegetable (258) Diseaseshttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:disease (16120) “is good for curing”, “increases the risk for”
CEOshttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:ceo (7289) Companieshttp://rtw.ml.cmu.edu/rtw/kbbrowser/pred:company (41660) “is ceo of”, “is a founder of”, “is a company established by”
Table 1. Input categories from the NELL KB and sample relations through DART
Input categories DART newOntExt (with best k-value)
no. of correct relations accuracy no. of correct relations accuracy
Rivers, Cities 15 0.42 4 0.15
Languages, Countries 22 0.63 7 0.54
Vegetables, Diseases 22 0.88 16 0.84
CEOs, Companies 19 0.86 11 0.58
Table 2. Evaluation Results-accuracy and number of meaningful relations obtained

From Table 2 we can see that DART performs better than newOntExt both as a recall-oriented system and as a precision-oriented system. Since clustering of patterns in newOntExt is based on co-occurrence values, dissimilar(but meaningful) patterns tend to get grouped together and hence many meaningful patterns get lost, leading to lower number of correct relations from newOntExt. For example, in the experiment conducted on the classes CEO and Company, newOntExt places the patterns “is the ceo of” and “is the founder of” into the same cluster because the two patterns occur between the same set of subject-object pairs. Only one pattern from a cluster gets chosen as the centroid of the cluster and output by newOntExt and hence the other pattern is dropped though it is a meaningful relation between the given classes. Also, the Word2Vec model used by DART has eliminated irrelevant patterns such as “are people living in” (in the case of Languages and Countries) leading to a better accuracy value of DART.

4.1.1. Grounding in the context of NELL relations

The convention followed by NELL and the LOD for naming the relations are different. In NELL, the domain and/or range names are appended to the actual relation to form the relation name. For example, the relation “flows through” which holds between the classes Rivers141414http://rtw.ml.cmu.edu/rtw/kbbrowser/pred:river and Cities151515http://rtw.ml.cmu.edu/rtw/kbbrowser/pred:city is named “riverflowsthroughcity” (in LOD, such a relation would be named “flowsThrough”). Similarly, the relation “side effect caused by” which holds between the classes Physiological Condition and Drugs is named “sideeffectcausedbydrug” in NELL. The advantage of using such a naming technique is that every sense of the relation can be captured through its name itself, thus giving no room for ambiguity. Hence there is no necessity for grounding the generated relations in the NELL Knowledge Base. Also, the main goal of DART is to enrich the LOD and hence we exclude the process of grounding in our experiments on the NELL KB. We compare the number of correct relations obtained through DART and those obtained through newOntExt (see Table 2) irrespective of whether they are already present in the NELL KB. This is to demonstrate the efficacy of DART vs newOntExt in the context of discovering relations between given classes.

4.1.2. Complexity of DART vs newOntExt

Table 3 gives the details about the time taken by DART and newOntExt for the four experiments.

Input classes DART newOntExt
Rivers, Cities 68 22.98
Languages, Countries 429 9.5
Vegetables, Diseases 37 6.51
CEOs, Companies 5 6.23
Table 3. Time taken (in seconds)

As newOntExt follows co-occurrence based clustering of patterns and DART performs semantic similarity check for clustering of patterns, the time taken by DART would be inherently higher than newOntExt. However we have attempted to reduce the computational complexity in two ways: by employing Word2Vec to filter patterns and by using single-pass clustering to cluster the patterns(as opposed to clustering algorithms like k-means which perform several iterations). For example, in the case of CEOs and Companies the initial number of patterns was 339, whereas after filtering through Word2Vec the number of patterns remarkably reduced to 51. Hence the final number of patterns subjected to clustering is low leading to a reduced consumption of time (even lesser than newOntExt). In most of the cases DART takes only around few seconds to 1 minute to perform its task (except for the case of Languages and Countries where 230 patterns are output by the Word2Vec stage and subjected to clustering). It is an interesting piece of future work to further optimize the working of DART.

4.2. Evaluation of DART on linked datasets

In this Section, we give an account of the experiments held to demonstrate the enrichment of LOD through DART, i.e we have chosen classes from linked datasets such as YAGO and DBpedia as our input classes. Table 4 gives the details of the input classes taken and a few sample relations obtained through DART. Here again, we have chosen these classes from different domains (Geography, Literature, History and Music) to prove that our approach is versatile.

D1 (size) D2 (size) Sample relations through DART
Religionshttp://dbpedia.org/class/yago/Religion105946687 (222) Countrieshttp://dbpedia.org/class/yago/Country108544813 (5726) “became the official religion in”, “is the predominant religion in”, “is the fastest growing religion in”
Empireshttp://dbpedia.org/class/yago/Empire108557482 (325) Rulershttp://dbpedia.org/class/yago/Ruler110541229 (9118) “ascended the throne of”, “declared war on”, “inherited the kingdom of”, “is founded by”
Writershttp://dbpedia.org/class/yago/Writer110794014 (10000) Novelshttp://dbpedia.org/class/yago/Novel106367879 (10000) “is written by”, “is a novel by”, “is a biography of”, “is the award winning author of”
Music genreshttp://dbpedia.org/ontology/MusicGenre (1245) Music genres “is a subgenre of”, “is more popular than”
Table 4. Input classes from the LOD and sample relations through DART

Table 5 gives an account of few relations which were mapped to the LOD properties in each experiment, and the action performed by DART on the grounded relations. The full list of all the grounded relations is available in our project web page11.

Input Classes LOD property Grounded relation Action taken by DART
Religions, Countries isLeaderOf is the father of Domain, range not matching-discard
Rulers, Empires isLeaderOf was ruler of Domain, range matched through subclass - candidate sub-property
Writers, Novels directed directed by Domain, range not matching-discard
Music genres, music genres musicSubgenre is a subgenre of Domain, range match - candidate equivalent property
Table 5. Grounding of relations

Table 6 shows the accuracy and the number of correct relations obtained for the input classes in Table 4. As done in Section 4.1, three ontology engineers were asked to evaluate the relations manually and a relation was considered correct only if all the three experts agreed that it is correct.

Input Classes No. of correct relations Accuracy
Religions, Countries 25 0.50
Empires, Rulers 10 0.833
Writers, Novels 15 0.52
Music genres, Music genres 9 0.69
Table 6. Evaluation Results-Accuracy and number of meaningful relations obtained

4.2.1. Value of grounding

Following the grounding technique explained in Section 3.3, DART discarded or retained the grounded relations appropriately. It should be noted that if the discarded irrelevant relations (such as “is the father of” in the case of Religions and Countries) had been included in the output of DART, then the accuracy of DART would have decreased. Hence, the grounding phase improves the performance of DART. The grounding phase also suggests candidate equivalence, sub-property and inverse property axioms between the relations and existing LOD properties. These property axioms can further be validated through techniques that are based on determining the support from the instances (Fleischhacker et al., 2012) and then added to the T-Box. We intend to do the validation and enrichment process as a part of our future work. DART has not been compared with any of the property alignment systems (such as those surveyed in (Gunaratna et al., 2014)) since the main goal of DART is to generate relations between two given classes only. DART suggests candidate property axioms which are yet to be supported by evidence from the A-Box. In that sense, DART can also be seen as a system which is capable of extracting new prospective inverse relations from text. For example, if one needs to find the inverse of the DBpedia property “author”, the domain and range of “author”, namely the classes Person and Book can be given as inputs to DART and DART would produce the relations both in the forward direction (the same direction as “author”) as well as the reverse. If any of the relations in the reverse direction get grounded to the property “author” (i.e the relation’s direction is opposite to that of “author” but its meaning is similar to “author”), then that relation is a prospective inverse property to the “author” property.

5. Conclusions and Future Work

The central idea behind this paper is to propose a completely automated and unsupervised technique to identify possible arbitrary relations between two classes of Linked Data. For this purpose, we have built a system, DART, whose working connects the techniques of contextual similarity checking and paraphrase detection into a unified framework for discovering new relations from the web patterns. DART then attempts to ground the discovered relations in the linked dataset in order to discard irrelevant relations and identify new relations. The fully automated grounding technique proposed in this paper also generates prospective property axioms for the enrichment of the linked dataset.

The results gathered reveal the potential of DART to unearth many interesting relations between a given pair of classes thus leading to the growth of a relationship-rich LOD. DART outperforms the state-of-the-art system with respect to the validity as well as the number of relations. As a part of our future work, we intend to validate the grounding phase to improve its accuracy and efficiency. We would also like to propose methods to validate the prospective property axioms generated through DART by means of gathering evidence from the generated relation instances.

References

  • (1)
  • Aprosio et al. (2013) Alessio Palmero Aprosio, Claudio Giuliano, and Alberto Lavelli. 2013. Extending the Coverage of DBpedia Properties using Distant Supervision over Wikipedia.. In NLP-DBPEDIA@ISWC (CEUR Workshop Proceedings), Sebastian Hellmann, Agata Filipowska, Caroline Barriere, Pablo N. Mendes, and Dimitris Kontokostas (Eds.), Vol. 1064. CEUR-WS.org.
  • Assis and Casanova (2014) PedroH.R. Assis and MarcoA. Casanova. 2014. Distant Supervision for Relation Extraction Using Ontology Class Hierarchy-Based Features. In The Semantic Web: ESWC 2014 Satellite Events, Valentina Presutti, Eva Blomqvist, Raphael Troncy, Harald Sack, Ioannis Papadakis, and Anna Tordai (Eds.). Lecture Notes in Computer Science, Vol. 8798. Springer International Publishing, 467–471.
  • Banerjee and Pedersen (2002) Satanjeev Banerjee and Ted Pedersen. 2002. An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing (CICLing ’02). Springer-Verlag, London, UK, UK, 136–145.
  • Barchi and Hruschka (2014) P. H. Barchi and E. Rafael Hruschka. 2014. Never-ending ontology extension through machine reading. In 2014 14th International Conference on Hybrid Intelligent Systems. 266–272.
  • Barchi and Hruschka (2015) P. H. Barchi and E. Rafael Hruschka. 2015. Two different approaches to Ontology Extension Through Machine Reading. Journal of Network and Innovative Computing 3, 1 (2015), 78–87.
  • Bizer et al. (2009) Christian Bizer, Julius Volz, Georgi Kobilarov, and Martin Gaedke. 2009. Silk - A Link Discovery Framework for the Web of Data. In 18th International World Wide Web Conference.
  • Carlson et al. (2010) Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. 2010. Toward an Architecture for Never-Ending Language Learning. In AAAI, Maria Fox and David Poole (Eds.). AAAI Press.
  • Del Corro and Gemulla (2013) Luciano Del Corro and Rainer Gemulla. 2013. ClausIE: Clause-based Open Information Extraction. In Proceedings of the 22Nd International Conference on World Wide Web (WWW ’13). 355–366.
  • Etzioni et al. (2011) Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam Mausam. 2011. Open Information Extraction: The Second Generation. In

    Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume One

    (IJCAI’11). AAAI Press, 3–10.
  • Färber et al. (2016) Michael Färber, Achim Rettinger, and Andreas Harth. 2016. Towards Monitoring of Novel Statements in the News. Springer International Publishing, Cham, 285–299.
  • Ferrucci et al. (2010) David Ferrucci, Eric Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya A. Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John Prager, Nico Schlaefer, and Chris Welty. 2010. The AI Behind Watson – The Technical Article. The AI Magazine (2010). http://www.aaai.org/Magazine/Watson/watson.php
  • Fleischhacker et al. (2012) Daniel Fleischhacker, Johanna Völker, and Heiner Stuckenschmidt. 2012. Mining RDF Data for Property Axioms. Springer Berlin Heidelberg, Berlin, Heidelberg.
  • Frakes and Baeza-Yates (1992) William B. Frakes and Ricardo Baeza-Yates (Eds.). 1992. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
  • Gunaratna et al. (2014) Kalpa Gunaratna, Sarasi Lalithsena, and Amit Sheth. 2014. Alignment and dataset identification of linked data in Semantic Web. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4, 2 (2014), 139–151.
  • Jain et al. (2012) Prateek Jain, Pascal Hitzler, Kunal Verma, Peter Z. Yeh, and Amit P. Sheth. 2012. Moving Beyond SameAs with PLATO: Partonomy Detection for Linked Data. In Proceedings of the 23rd ACM Conference on Hypertext and Social Media (HT ’12). ACM, New York, NY, USA, 33–42.
  • Krause et al. (2012a) Sebastian Krause, Hong Li, Hans Uszkoreit, and Feiyu Xu. 2012a. Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web. In The Semantic Web - ISWC 2012. Lecture Notes in Computer Science, Vol. 7649. Springer Berlin Heidelberg, 263–278.
  • Krause et al. (2012b) Sebastian Krause, Hong Li, Hans Uszkoreit, and Feiyu Xu. 2012b. Large-Scale Learning of Relation-Extraction Rules with Distant Supervision from the Web. In The Semantic Web - ISWC 2012. Lecture Notes in Computer Science, Vol. 7649. Springer Berlin Heidelberg, 263–278.
  • Lehmann et al. (2015) Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6 (2015), 167–195.
  • Limaye et al. (2010) Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. 2010. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow. 3, 1-2 (Sept. 2010), 1338–1347.
  • Mahdisoltani et al. (2015) Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. 2015. YAGO3: A Knowledge Base from Multilingual Wikipedias. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4-7, 2015, Online Proceedings.
  • Mausam et al. (2012) Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni. 2012. Open Language Learning for Information Extraction. In

    Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

    (EMNLP-CoNLL ’12). 523–534.
  • Mihalcea et al. (2006) Rada Mihalcea, Courtney Corley, and Carlo Strapparava. 2006. Corpus-based and Knowledge-based Measures of Text Semantic Similarity. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1 (AAAI’06). AAAI Press, 775–780.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). 3111–3119.
  • Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2 (ACL ’09). Association for Computational Linguistics, Stroudsburg, PA, USA, 1003–1011.
  • Mohamed et al. (2011) Thahir P. Mohamed, Estevam R. Hruschka, Jr., and Tom M. Mitchell. 2011. Discovering Relations Between Noun Categories. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’11). 1447–1455.
  • Muñoz et al. (2013) Emir Muñoz, Aidan Hogan, and Alessandra Mileo. 2013. Triplifying Wikipedia’s Tables.. In LD4IE@ISWC (CEUR Workshop Proceedings), Anna Lisa Gentile, Ziqi Zhang, Claudia d’Amato, and Heiko Paulheim (Eds.), Vol. 1057. CEUR-WS.org.
  • Muñoz et al. (2014) Emir Muñoz, Aidan Hogan, and Alessandra Mileo. 2014. Using Linked Data to Mine RDF from Wikipedia’s Tables. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM ’14). ACM, New York, NY, USA, 533–542.
  • Mulwad (2010) Varish Mulwad. 2010. T2LD - An automatic framework for extracting, interpreting and representing tables as Linked Data. Master’s thesis.
  • Mulwad et al. (2010a) Varish Mulwad, Tim Finin, Zareen Syed, and Anupam Joshi. 2010a. T2LD: Interpreting and Representing Tables as Linked Data. In Proceedings of the ISWC 2010 Posters & Demonstrations Track: Collected Abstracts, Shanghai, China, November 9, 2010.
  • Mulwad et al. (2010b) Varish Mulwad, Tim Finin, Zareen Syed, and Anupam Joshi. 2010b. Using Linked Data to Interpret Tables. In Proceedings of the First International Workshop on Consuming Linked Data, Shanghai, China, November 8, 2010.
  • Navarro (2016) Lucas Fonseca Navarro. 2016. Mining Ontologies to Extract Implicit Knowledge. Ph.D. Dissertation. Federal University of Sao Carlos.
  • Nguyen and Moschitti (2011) Truc-Vien T. Nguyen and Alessandro Moschitti. 2011. End-to-end Relation Extraction Using Distant Supervision from External Semantic Repositories. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2 (HLT ’11). Association for Computational Linguistics, Stroudsburg, PA, USA, 277–282.
  • Paulheim (2017) Heiko Paulheim. 2017. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8, 3 (2017), 489–508.
  • Rajaraman and Ullman (2011) Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA.
  • Ritze et al. (2015) Dominique Ritze, Oliver Lehmberg, and Christian Bizer. 2015. Matching HTML Tables to DBpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics (WIMS ’15). ACM, Article 10, 6 pages.
  • Suchanek et al. (2007) Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A Core of Semantic Knowledge. In Proceedings of the 16th International Conference on World Wide Web (WWW ’07). 697–706.
  • Suchanek et al. (2009) Fabian M. Suchanek, Mauro Sozio, and Gerhard Weikum. 2009. SOFIE: A Self-organizing Framework for Information Extraction. In Proceedings of the 18th International Conference on World Wide Web (WWW ’09). ACM, New York, NY, USA, 631–640.
  • Syed et al. (2010) Zareen Syed, Tim Finin, Varish Mulwad, and Anupam Joshi. 2010. A.: Exploiting a Web of Semantic Data for Interpreting Tables. In In: Proceedings of the Second Web Science Conference.
  • Töpper et al. (2012) Gerald Töpper, Magnus Knuth, and Harald Sack. 2012. DBpedia Ontology Enrichment for Inconsistency Detection. In Proceedings of the 8th International Conference on Semantic Systems (I-SEMANTICS ’12). ACM, New York, NY, USA, 33–40.