Linked Open Data Validity -- A Technical Report from ISWS 2018

03/26/2019
by   Tayeb Abderrahmani Ghor, et al.
0

Linked Open Data (LOD) is the publicly available RDF data in the Web. Each LOD entity is identfied by a URI and accessible via HTTP. LOD encodes globalscale knowledge potentially available to any human as well as artificial intelligence that may want to benefit from it as background knowledge for supporting their tasks. LOD has emerged as the backbone of applications in diverse fields such as Natural Language Processing, Information Retrieval, Computer Vision, Speech Recognition, and many more. Nevertheless, regardless of the specific tasks that LOD-based tools aim to address, the reuse of such knowledge may be challenging for diverse reasons, e.g. semantic heterogeneity, provenance, and data quality. As aptly stated by Heath et al. Linked Data might be outdated, imprecise, or simply wrong": there arouses a necessity to investigate the problem of linked data validity. This work reports a collaborative effort performed by nine teams of students, guided by an equal number of senior researchers, attending the International Semantic Web Research School (ISWS 2018) towards addressing such investigation from different perspectives coupled with different approaches to tackle the issue.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

12/22/2020

Knowledge Graphs Evolution and Preservation – A Technical Report from ISWS 2019

One of the grand challenges discussed during the Dagstuhl Seminar "Knowl...
06/07/2020

An Empirical Meta-analysis of the Life Sciences (Linked?) Open Data on the Web

While the biomedical community has published several "open data" sources...
03/05/2016

A Linked Data Scalability Challenge: Concept Reuse Leads to Semantic Decay

The increasing amount of available Linked Data resources is laying the f...
03/26/2018

Empirical Analysis of Foundational Distinctions in the Web of Data

A main difference between pre-Web artificial intelligence and the curren...
12/23/2017

Freebase-triples: A Methodology for Processing the Freebase Data Dumps

The Freebase knowledge base was a significant Semantic Web and linked da...
06/04/2020

Stopwords in Technical Language Processing

There are increasingly applications of natural language processing techn...
07/29/2021

Towards Semantic Interoperability in Historical Research: Documenting Research Data and Knowledge with Synthesis

A vast area of research in historical science concerns the documentation...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

2.1 Related Work

[100] offers a survey of the existing literature contributing to locating place names. The authors focus on the positional uncertainties and extent of vagueness frequently associated with the place names and with the differences between common users perception and the representation of places in gazetteers. In our work, we attempt to address the problem of uncertainty (or validity) of place names extracted from textual documents by exploiting existing knowledge resources – structured Linked Open Data resources.

[27] aims to address the uncertainty of categorical Web data by means of the Beta-Binomial, Dirichlet-Multinomial and Dirichlet Process models. The authors mainly focused on two validity issues: (i) multi-authoring nature of the Web data, and (ii) the time variability. Our work addresses the same Web data validity issues. However, in our approach, we propose to use existing structured linked datasets (i.e., GeoNames333GeoNames http://www.geonames.org/ and DBPedia444DBpedia https://wiki.dbpedia.org/) to validate the information –place names– extracted from textual documents.

In [91], a framework called LINDEN is presented to link named entities extracted from textual documents using a knowledge base, called YAGO, an open-domain ontology combining Wikipedia and WordNet [94]. To link a given pair of textual named entities (i.e., entities extracted from text), the authors proposed to identify equivalent entities in YAGO, then to derive a link between the textual named entities according to the link between the YAGO entities when it exists. Linking textual named entities to existing Web knowledge resources is a common task between our approach and that presented in [91]. However, [91] focuses on linking textual named entities, while our work focuses on validating textual named entities. Moreover, in [91], the authors exploited one knowledge base (i.e., YAGO), while in our work, we used two knowledge bases (i.e., GeoNames and DBPedia).

[36] propose an automatic approach for georeferencing of textual localities identified in a database of animal specimens, using GeoNames, Google Maps and the Global Biodiversity Information Facility. However, our approach takes a specific domain raw text as an input. Our goal is not to georeference, but to validate the identification of these locations using GeoNames and DBpedia.

[48] reports on the the use of Edinburgh geoparser for georeferencing digitized historical collections, in particular the paper describes the work that was undertaken to configure the geoparser for the collections. The validity of data extracted is done by consulting lists of large places derived from GeoNames and Wikipedia and decisions are made based on a ranking system. However, the authors don’t make any assumptions about whether the data in GeoNames or the sources from which they extract information is valid or not.

2.2 Resources

The structured data can be in the form of an RDF dataset such as DBpedia and GeoNames and the unstructured data can be in any form of natural language text. We have chosen to work with a corpus of historical writings regarding travel itineraries named as “Two days we have passed with the ancients… Visions of Italy between XIX and XX century”555Italian Travel Writings Corpus https://sites.google.com/view/travelwritingsonitaly/. We propose that this dataset provides rich use cases for addressing the textual data validity defined in Introduction section for 4 reasons:

  • It contains 30 books that correspond to the accounts written by travelers who are native English speakers traveling in Italy.

  • The corpus consists of the accounts of travelers who have visited Italy within the period of 1867 and 1932. These writings share a common genre, namely ”travel writing”. Therefore, we expect to extract location entities that are valid during the time of the travelling. However, given that the corpus covers a span of 75 years, it potentially includes cases of contradicting information due to various updates on geographical entities.

  • The corpus might also contain missing or invalid information due to the fact that the travelers included in the dataset are not Italian natives, and therefore we cannot assume that they are experts on the places they visited.

  • The corpus also contains pieces of non-factual data, such as the traveler’s opinions and impressions.

Since the dataset we select corresponds to the geographical data, we selected structured data sources that deal with the geographical data. In this project, we utilize GeoNames and DBpedia. GeoNames is a database of geographical names that contains more than 10,000,000 entities. The project is initiated by the geographical information retrieval researchers and the core database is provided by official government sources and the users are able to update and improve the database by manually editing contained information. Ambassadors from all continents contribute to the GeoNames dataset with their specific expertise. Thus, we assume that the data included in GeoNames is of sufficient quality. In addition, we select DBpedia as a reliable structured database since it is based on Wikipedia, that provides the volunteers with methods to enter new information and to update inconsistent or wrong information. Therefore, we assume that it is a reliable source of information regarding the geographical entities. The current version of DBpedia contains around 735,000 places. Information in DBpedia is not updated live, but around twice a year, thus, it is not sensitive for live information, e.g. an earthquake in a certain location or a sudden political conflict between states. However, since working with historical data and not with live events, we propose that it is valid to include geographical information from DBpedia.

2.3 Proposed Approach

As mentioned in the Introduction section, NLP can be utilized to assess two different issues of validity, textual data validity and Linked Data validity.

Textual data validity

refers to the validity of the information that is extracted from documents of a given corpus. In our work, we use the named entities obtained by the NLP pipeline to achieve this goal. Our proposed method consists of 5 steps:

  • Sentence Tokenization: This corresponds to determining sentences from the input corpus.

  • Word Tokenization: This corresponds to the determining words within each sentence identified in the sentence tokenization step.

  • PoS Tagging: This step annonates the tokenized sentences with part of speech (PoS) tags.

  • Named Entity Recognition (NER): This step identifies different types of entities employing the output of PoS tagging. In the NLP literature, the recognized entities can either belong to one class (named entity) or a set of classes (place, organization, location). For the textual data validity problem, the choice of a single class or a set of classes depends on the use case.

  • Named Entity Linking (NEL): This step links the named entities obtained by the previous step to the structured datasets. In our method, this corresponds to linking entities to the linked open data sources. Since the underlying assumption is that the structured datasets are reliable, we can conclude that the entities that have been linked are valid entities.

Example 1.

Consider the sentence “For though all over Italy traces of the miracle are apparent, Florence was its very home and still can point to the greatest number of its achievements.”. The outputs obtained at the end of the steps are provided below.

  • Word Tokenization: For, though, all, over, Italy, traces, of, the, miracle, are, apparent, Florence, was, its, very, home, and, still, can, point, to, the, greatest, number, of, its, achievements

  • PoS Tagging: (For, IN), (though, IN), (all, DT), (over, IN), (Italy, NNP), (traces, NNS), (of, IN), (the, DT), (miracle, NN), (are, VBP), (apparent, JJ), (Florence, NNP), (was, VBD), (its, PRP$), (very, RB), (home, NN), (and, CC), (still, RB), (can, MD), (point, VB), (to, TO), (the, DT), (greatest, JJS), (number, NN), (of, IN), (its, PRP$), (achievements, NNS)

  • NER: (Italy, location), (Florence, location)

  • NEL: (Bertinoro, location, 2343, bertinoro_URI), (Italy, location, 585, italy_URI)

Linked Data validity

refers to the validation of Linked Data using the information extracted from trusted textual sources. In order to identify whether a given RDF triple is valid or not, we also propose an approach based on the NLP pipeline. This approach goes deeper into the text, as it also tries to identify relations after the NER step, to generate ¡subject¿ ¡predicate¿ ¡object¿ triples. These triples can then be matched to the RDF triples whose validity we aim to assess. If the information is consistent between the input and extracted relations, we conclude that the RDF triple is valid according to the textual data. Moreover, the proposed method can also be employed in order to find out the missing information related to the entities that are part of the structured data set. Due to time constraints, this approach is yet to be implemented.

Example 2.

Let us assume that a structured dataset contains an RDF triple (dbr:Istanbul dbo:populationMetro 11,174,200). However, we have a document that is published recently that has a statement “The population of Istanbul is 14,657,434 as of 31.12.2015”. The last step of the algorithm should be able to identify the RDF triple (dbr:Istanbul, dbo:populationMetro, 14,657,434). Then, we can conclude that the input RDF triple is not valid.

2.3.1 Evaluation and Results: Use case/Proof of concept - Experiments

As explained in the Resources section, a corpus consisting of travel diaries of English-speaking travelers in Italy between 1867 and 1932 was used. Furthermore, DBpedia and GeoNames were selected as the connecting structured databases since they contain geographical entities. We present our experimental workflow in Figure 2.1.

Figure 2.1: Natural Language Processing workflow

In order to complete tokenization, Part-of-Speech tagging, and NER, we used the Natural Language Toolkit library (NLTK)666https://www.nltk.org/ [20]

. NLTK offers an easy-to-use interface and it has a built-in classifier for NER. We extracted all named entities belonging to the Person, Location and Organization categories, and then focused only on Location entities. Then, we used GeoNames and DBpedia for NEL. In order to enhance the matching quality, we have used the exact matching method. We used 29 documents out of 30 documents for our analysis, since one of the documents had an unicode encoding error.

In total, we have identified 16,037 named location entities in 29 documents. Linking with GeoNames produced 8181 linked entities, and with DBpedia, 8,762. We were thus able to validate more than 50% of the entities with either one of the structured data sets.

For the next step of our analysis, we selected only the linked entities from GeoNames.

Figure 2.2: Top-10 Countries in Linked Location Entities

First, we checked country information for these entities. Figure  2.2 presents the top-10 countries where the linked location entities are. As expected, most of them are located in Italy. This suggests that GeoNames database has a good coverage of geographical entities in Italy. We have also entities from other countries. This might be due to several reasons. First of all, the name of the current location might be different than its name in the time of the author’s visit to Italy. Second, there might be some locations that are now part of a different country. Third, there may exist geographical entities with the same name in other countries.

Figure 2.3: Top-10 Location Types in Linked Location Entities

Figure 2.3

presents the top-10 types of the linked entities. As expected, the named entities are generally populated areas and administrative areas. However, the third most frequent location type is hotel. This probably corresponds to some problems regarding entity linking since the selected corpus consists of historical travel documents dated between 1867 and 1932. The reason of entities being linked to hotels would be the contemporary hotels with historical names. In future work issue needs to be checked in further detail.

Figure 2.4 displays the number of location entities, the number of entities linked using GeoNames and the number of entities linked using DBpedia for each file. The text under each column-group corresponds to the title of the document. As can be seen, the number of entity linkings from GeoNames and DBpedia is quite dependent on the content of the document. In half of the documents GeoNames performs slightly better than DBpedia and vice versa. The figure shows that it cannot be clearly stated that one of the selected structured database works better than the other one for the textual data validity of documents regarding geographical entities. However, we have found an example corresponding to a name change of a location in Sicily. The previous name was Monte San Giuliano and now it is called as Erice. When we lookup the name of Monte San Giuliano from GeoNames, we managed to find the contemporary location entity due to the fact that GeoNames contains the information regarding old names. However, it was not possible to locate this entity in DBpedia. For this reason, if the entities are extracted from documents corresponding to historical information, it would be better to utilize GeoNames database.

Figure 2.4: Number of Entities and Entity Linkings from GeoNames and DBpedia

2.4 Discussion and Conclusion

Textual documents are a rich source of knowledge that, due to their unstructured nature, is currently unavailable in the Linked Data cloud. NLP techniques and tools are specially developed to extract the information encoded in text so that it can be structured and analyzed in a systematic manner. Until now, the opportunities for intersection between NLP and Linked Data have not received much attention from either the NLP or the Semantic Web community, even though there is an unexplored potential for investigation and application to real-world problems.

We proposed an approach to explore this intersection, based on two definitions of validity: textual data validity and Linked Data validity. We selected a textual corpus of travel writings from the 19th and 20th centuries, and applied NLP-based methods to extract location entities. Then, we linked those entities to the structured Linked Data from DBpedia and GeoNames in order to validate the extracted data.

The contributions of this paper include:

  • A definition of Linked Data validity in the context of Natural Language Processing;

  • The combination of two trusted knowledge sources to validate the entities extracted from text;

  • The execution of experiments on a corpus of original travel writings by native English speakers;

  • A proposition of a generic approach which may be easily reproduced in other contexts.

Our approach has the following strengths:

  • We use knowledge from different types of sources (i.e. extracted through NLP and from Linked Data)

  • Our prototype uses off-the-shelf tools, providing an easy entry-point into assessing Linked Data validity from the NLP perspective.

Naturally, there are also some weaknesses to our approach:

  • Assumption that dbpedia/geonames are reliable sources for validating the data

  • NLP tools are not adapted to the historical travel writings domain and thus may make more mistakes than optimised resources.

In our work we addressed the issues of Textual data validity and Linked Data validity. We showed that structured data extracted from text through NLP is a promising approach to address both the issues. Structured data from reliable sources could be used to validate data extracted with NLP, and reliable textual sources could be processed with NLP techniques to be used as a reference knowledge base to validate Linked Data sets.

In this research report, we focused on the first aspect of Linked Data validity from an NLP perspective, namely checking the output of an NLP system against a Linked Data resource. In future work, we will also address the second aspect, namely checking the validity of a Linked Data resource using NLP output extracted from a reliable text source. We will connect to research on trust and provenance on the semantic web, to assess and model trust and reliability.

Furthermore, we plan to extend our experiments by enlarging the dataset, consider more knowledge bases to compare with and include other domains. We plan to extract more properties, attributes, and historical information about the extracted locations as such a list of properties might further automate the validation process. Finally, for those entities that are not found in the different knowledge bases, we plan to have an automatic system to add them there with the different extracted properties. For example, in case of extracting a piece of historical information as we saw in the case of the old names of Erice as Monte San Giuliano, we can update this new information to the required knowledge base such as DBpedia.

3.1 Related Work

While research of contexts has been extensively discussed in AI [11], there are still no comprehensive studies on the formal representation of contexts and its application on Semantic Web. Guha et al. [50] already highlighted the obstacles posed by differences in data context: for example, two datasets may provide their data using the same data model and the same vocabulary. However, subtle differences in data context pose additional challenges for aggregation: these datasets may be related to different topics, they may have been created in different time or from different points of view.

Information about the context is often not explicitly specified in the available Semantic Web resources, and even if so, it often does not follow a formally defined representation model, even inside the same resource. A few number of extensions to Semantic Web languages have been already proposed with the aim to handle context [99, 58, 88, 89]:

Both Annotated RDF [99] and Context Description Framework [58] extend RDF triples with an n-tuple of attributes with partially ordered domains. The additional components can be used to represent the provenance of an RDF triple or it could also be used for directly attaching other kind of meta-facts like context information.

Serafini, L. et al. [88, 89] proposed a different approach, called Contextualized Knowledge Repository (CKR), build on top of the description logic OWL 2 [46]. Contextual information is assigned to contexts in form of dimensional attributes that specify the boundaries within which the knowledge base is assumed to be true. The context formalization is sufficiently expressive but at the same time more complex. The presented approaches vary widely and a broadly accepted consensus has not yet been reached so far. Moreover, all of them require extensive work to adapt the existing knowledge bases to the proposed new formalism. As opposite, we propose an approach that:

  • makes easier to extend the existing knowledge resources with context information;

  • allows to access them considering the user context.

Another important issue is, which definition of a context is reasonable to use within Semantic Web: there is, as yet, no universally accepted definition nor any comprehensive understanding of how to represent context in the area of knowledge base system. An overview of existing interpretations of context can be found at [56].

3.2 Proposed Approach

Contextual information is important for LOD validity. Existing work requires an adaption of existing knowledge bases. In the following, we i.a. define different context dimensions and how they can be used to describe meta information of datasets, which doesn’t involve an adaption of existing knowledge bases.

Overview

As previously stated in Section 3.1, there is not yet a widely accepted definition of context in the field of Semantic Web. To formulate our definition, we choose to start from a relatively general one, extracted from the American Heritage Dictionary [70]:

1. The part of a text or statement that surrounds a particular word or passage and determines its meaning. 2. The circumstances in which an event occurs; a setting.

The first definition is largely applied in the field of Natural Language Processing when dealing with textual data, while the second one has been already applied in many AI fields, for example Intelligent Information Retrieval [2]. Based on the second definition, we can identify at least three different levels at which it can be applied in RDF:

  1. Dataset Level: This is the external context surrounding an entire dataset. It reflects the circumstances in which the dataset has been created (e.g. information about the source, time of creation, purpose of the dataset, name of the author and much more).

  2. Entity Level: This is the internal context surrounding an entity of a graph. It reflects the circumstances in which the concept represented by the entity “lives” or occurs.

  3. Triple Level: This is the specific context surrounding a single triple in a graph. It reflects the circumstances in which the relation between the subject and the object holds.

The approaches of [58, 50] follows the third definition, while the one of [88, 89] is based on the first one. As explained in [23], approaches based on the triple definition makes the knowledge difficult to be shared, encapsulated and easily identified. For this reason, on our approach, we rely only on the Dataset and the Entity Level definitions. The definition of the user context we adopt follows the widely used definition employed in AI systems [11]: the circumstances in which the user queries the knowledge resources (e.g. geo-location, language, interests, purposes etc.).

Based on the previous definitions of context, we define LOD Validity in the following way:

Given the context of the knowledge resource and the context of the user, the validity of the retrieved data is a function of the similarity between the two contexts.

Dimensions and metrics

Context is not an absolute and independent measure. Several dimensions with their respective metrics can have an influence on the context. We have identified three different contextual dimensions (i) spatio-temporal, (ii) purpose/intention, and (iii) knowledge base population. All the three contextual dimensions apply to the knowledge base; the first two dimensions additionally apply to the user, who is querying the knowledge base (see Figure 1). The first and maybe the most important dimension is composed by spatio-temporal contextual factors. Several related metrics can influence the context:

  • Time at a triple level: a fact can become invalid with time, therefore there is a need for time information such as start, end, duration, last update.

  • Time at an entity level: properties and values of an entity can change over time.

  • Time at dataset level: an event that appends after the update or creation of a dataset can not be found in this dataset, this is the reason why creation date and last update time are important information to have.

  • Geographic, political and cultural at a dataset level: a political belief, or the native language, or the location can influence the answer one would expect. Is an adult someone who is older than 18 years or 21 years?

The second dimension is the purpose or intention of the dataset. A dataset might be created for a certain purpose that could be modeled with a list of topics. A dataset that does not contain the topics required to answer a query will not be able to provide the expected answer to this specific query. Thus, both user and dataset intentions must match or at least overlap. For example, the President of the U.S.A. can differ between a dataset about politics and a dataset about fictional characters.

The third dimension is Knowledge Base population context. What is the provenance of the data? How many sources are there? What are the methods and/or algorithms used to populate the KB? A user may, for example, prefer to have human generated data like Wikidata over programmatically generated data like DBpedia. Also, when creating a dataset from a source dataset, approximations or wrong information may be propagated from the source dataset to the new dataset.

3.3 Survey of Resources

Generalistic datasets

This category includes as e.g. Wikidata and DBpedia. Even if context metadata are not expressed among their triples, it is well know what the purpose of the dataset is and how they have been generated.

As contextual information (meta-information about the validity about a single atomic information), Wikidata provides property qualifiers111https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial#Qualifiers (accessed on the 06/07/2018) that have the capability to declare (amongst others) the start and end time of the validity of a statement (e.g. “USA has president Obama”)

?statement1 pq:P580 2009. # qualifier: start time: 2009 (pseudo-syntax)
Domain or application specific datasets

This category includes the less popular datasets that cover a specific domain or application. These datasets often include more descriptive metadata about themselves, frequently following the Dublin Core standard, so that they easier to parse and to include in other dataset collections (i.e. in the LOD cloud). A survey reveals that the authors of these datasets are in part conscious of the importance of expliciting context, even if with different outcomes.

Table 1 presents a brief survey made on the provided datasets. Temporal context is the most commonly expressed via properties such as dct:created, dcterms:issued or prov:startedAtTime. The purpose of the dataset is moat times contained in the documentation or in a free-text description. Geo-political or methods contextual metadata are not provided.

A positive example of the context of generation of each entity is provided by ArCo, where a specific MetaInfo object is directly linked to the subject entity that is described, specifying the time of generation and the exact source of the information. Always in ArCo, some information is directly represented as temporal-dependent, such as the location of a cultural object222E.g. http://wit.istc.cnr.it/lodview/resource/TimeIndexedQualifiedLocation/0100200684-alternative-1.html (accessed on the 06/07/2018) .

3.4 The Provenance Ontology

The need for metadata describing the context of the data generation is not new in the LOD environment, and different ways of modelling it have been proposed. One existing solution is the Provenance Vocabulary Core Ontology333http://trdf.sourceforge.net/provenance/ns.html# (accessed on the 06/07/2018) [52]. Extending the W3C PROV Ontology (commonly known as PROV-O) [72], this vocabulary defines the DataCreation event, to which it is possible to directly link a set of properties that cover our newly introduced contextual dimensions:

  • prov:atLocation (geo-spatial)

  • prov:atTime (time)

  • prv:usedData (kb population, source)

  • prv:performedBy + prov:SoftwareAgent (kb population, methods)

  • prv:performedBy + prv:HumanAgent (kb population, author)

The DataCreation can be linked through prv:createdBy to the dataset or to any entity, giving the possibility of expliciting the context at different granularities. Figure 2 shows an example of how to model the DataCreation for a generic dataset. Provenance (prv: or prov: for PROV-O original properties) is used for most of the dimensions, while Dublin Core444http://purl.org/dc/elements/1.1/subject (accessed on the 06/07/2018) (dc:) is used for the purpose definition.

Time Handling Templates

The goal of the presented time handling templates is to facilitate data usage by giving intelligible hints to the user about how data can be temporally queried. This removes the need for a time-consuming study of the entire dataset structure from the user side. For each dataset, example(s) SPARQL queries should be provided by the data owner in the form of metadata. Thus, any data user could quickly be able to write a temporal query.

DBpedia handles duration (or time periods) in several ways (the following list may not be exhaustive):

  • By using an instance of the dbo:TimePeriod class (or one of its subclasses).

    • specific datatype properties might indicate the duration of the considered time period. For example, we can consider a time window of the career of the football player Paul Pogba. dbr:Paul_Pogba__3 is an instance of a subclass of dbo:TimePeriod and has the property dbo:years indicating the year of this period of time.

      Template:

      SELECT *
      WHERE {
        [SUBJECT] a dbo:TimePeriod ;
              dbo:[DATATYPE\_PROPERTY\_WITH\_TIME\_RANGE] ?timeValue
      }
      
    • specifying the considered time period directly in the type. For example, the Julian year 1003 is represented by the resource dbr:1003 whose type is dbo:Year.

  • By using specific (couples of) datatype properties. The type of the time measurement (e.g. year) is specified in the name and more formally in the range. The differentiation between starting and ending events is encoded in the name of the property. For example, dbo:activeYearsStartYear and dbo:activeYearsEndYear or dbo:activeYearsStartDate and dbo:activeYearsEndDate.

    Template:

    SELECT *
    WHERE {
      ?subject dbo:[PropertyName][Start|End][TimeType]
    }
    

    However, since the semantics of the properties is not explicitly provided, its interpretation requires the manual effort of the data creator.

Wikidata on the other side uses the concept of qualifiers to express additional facts and constraints about a triple (by using the specific prefixes: p, ps and pq for alternative namespaces to distinguish qualifiers from regular properties). For example, the assertion “Crimean Peninsula is a disputed territory since 2014” is expressed by the statements and . Wikidata template:

SELECT *
WHERE {
  ?subject p:[PROPERTY\_ID] ?statement.
  ?statement pq:[TIME\_PROPERTY\_ID] ?timeInformation .
}

3.5 Conclusion

As stated in the introduction, knowledge is created by humans. Anyway, humans have their own beliefs, which might introduce a bias. We argue that this belief is an important contextual dimension for LOD validity as well. For example, Wikipedia is an online encyclopedia which is curated by multiple users and therefore might contain less bias than a dataset created and curated by a single person. However, these beliefs are manifold and possibly implicit, which makes it hard to express them explicitly, both, formally and informally. Therefore we didn’t include a contextual personal belief dimension.

LOD contains contextual information on the dataset level in the form of meta information and within the dataset in the form of data. Based on examples, we have shown that contextual information are an important part of LOD Validity. For example, data may vary over time at multiple levels, or user’s expectation may depend on her cultural context. In this work, we have provided a set of dimensions that can influence either dataset and user contexts. We demonstrate the importance, for both user and dataset owner, to provide this information in the form of metadata using the Provenance Ontology. We also provide a way to add, in metadata, templates to show to users how to use temporal data in the dataset without time-consuming study of the data.

We proposed to reuse existing vocabularies to describe contextual meta information of datasets. Future work can investigate the usage of statistical data (and their semantic representation) regarding contextual dataset data, to facilitate the selection of a dataset fitting the purpose of the user.

DIMENSION Scholarly Data Data.cnr.it ArCo Pubmed Food Food (subdatasets)
time DATASET ENTITY ENTITY DATASET DATASET
geo-political
purpose/intention UI DATASET
author DATASET DATASET ENTITY DATASET
source DATASET ENTITY ENTITY ENTITY
methods
Table 3.1: A survey on the contextual information in the datasets provided for that report with the help of a SPARQL endpoint. Green indicates that information is explicit and machine-readable. Yellow indicates that information is present as plain natural language text only to be further interpreted. Otherwise, no information is provided.
Figure 3.1: The three identified contextual dimensions for LOD Validity. All three are relevant for the dataset and two of them are relevant for the user.
Figure 3.2: The contextual metadata for a Football dataset realised by the authors today in Bertinoro.

4.1 Related Work

Our research work is strictly related to the Linked Data Quality (LDQ) field, because we consider dimension like accuracy, completeness, consistency, and novelty in order to compute validity. In the field of LDQ, we identify three different type of contributions: (i) works focused on the definition of quality in LD, (ii) approaches to detect issues and improve quality according to such definitions, (iii) implementation of tools and platforms based on this approaches. For the first type of contribution, we remark the work of [104] that discusses with a systematic literature review many works on data quality assessment. For the second type of contributions focused on the approach, we mention the work of [21] that propose to apply filters on all available data to preserve high-quality information. For the third kind of contribution, related to the implementation, we report the work of  [59] that present a tool inspired by test-driven software development techniques, to detect quality problems in LOD. In particular, they define test to detect data quality problems, based on the semi-automated instantiation of a set of predefined patterns expressed in SPARQL language. Our research can be counted between works related to the approaches developed to identify quality issues, but focused on the dimension of validity.

As mentioned in the previous Section, we can also define CQs in order to establish the validity of an ontology (or a KG) for specific tasks. Traditionally, CQs are used for ontology development in specific use cases gathering functional user requirements [55], and ensuring that all relevant information is encoded. Other works are more focused on specific methodologies to use CQs. For instance [29] proposed an approach to transform use cases descriptions expressed in a Controlled Natural Language into an ontology expressed in the Web Ontology Language (OWL), allowing the discovery of requirement patterns formulating queries over OWL datasets. In other cases CQs consist in a set of questions that an ontology should be able to answer correctly according to a given use case scenario [49]. A wide spectrum of CQs, their usefulness in ontology authoring and possible integration into authoring tools have been investigated [32, 53, 80]. Unlike such research works, our approach does not focus on the construction of ontologies, but on their validation for the achievement of specific purposes within a well-defined domain.

Finally, for the data preparation stage for validity evaluation, we can mention works related to link discovery. Such works try to identify semantically equivalent objects in different LOD sources. Most of the existing approaches reduce the link discovery problem to a similarity computation problem adopting some similarity criteria and the corresponding measures in order to evaluate similarities among resources[75]. The selected criteria could involve both the properties and the semantic context of resources. However, all these approaches focus their attention in finding similarities among LOD sources which belong to the same domain. On the contrary, in our project we tried to discover similarities among general and domain-specific LOD knowledge bases. Other techniques based on entity-linking like DBpedia Spotlight [69] and TellMeFirst [83] can be exploited for link discovery starting from natural language description of the entities.

4.2 Resources

A mentioned in the first Section, our approach requires at least one Ground-Truth Knowledge Graph (GTK) that plays the role of oracle in our evaluation and a Test-Set Knowledge Graph (TS-KG) that should be evaluated according to GT-KG. Several KGs have been proposed in the literature, many of them specialized in a particular domain, while general KGs commonly focus the attention real-world entities and its relations.

Expert KGs focus on a specific domain which contains deep and detailed information about a particular area of knowledge. With this characteristics we can highlight DRUGS, a KG that includes a valuable information of drugs, since a bioinformatics and cheminformatics point of view. BIO2RDF is other expert KG that deal data for the Life Sciences. As we decided to focus our attention on the Cultural Heritage field, we chose ArCo as Ground-Truth Knowledge Graph. ArCo is a recent project, started in November 2017 by the Istituto Centrale per il Catalogo e la Documentazione (ICCD) and the Istituto di Scienze e Tecnologie della Cognizione (ISTC). Its aim is to enhance the Italian cultural value creating a network of ontologies which model the knowledge about the cultural heritage domain. From the modelling point of view, ArCo tries to apply good practices concerning both the ontology engineering field and the fulfillment of the users requirements.

In particular, ArCo is a project oriented to the re-use and the alignment of existing ontologies through the adoption of ontology design patterns. Moreover, following an incremental development approach, it tries to fulfil in every stage the user requirements which are provided by a group of early adopters. Some examples of early adopters could be a firm, a public institution or a citizen. Their contribute to the development of the project testing the preliminary versions of the system and providing real use cases to the team of developers.

On the other side, one of the most popular general KG is DBpedia [21], which is automatically created from the Wikipedia editions, considering only the title, abstract and its semi-structured information (e.g., infobox fields, categories, page links, etc). In this way, the quality of the DBpedia data depends directly on the Wikipedia data, which is important because Wikipedia is a large and valuable source of entities, but its quality is questionable because anyone can contribute. This problem also affects the cross-language information of DBpedia. For instance, if we go to the page of Bologna in the English and Italian version of DBpedia, it will not be the equivalent information.

In order to homogenize the description of the information in DBpedia, the community has devoted efforts to develop an ontology scheme, which gathers specific information such as the properties of the Wikipedia infobox. This Ontology was manually created, and currently consist in 685 classes which form a subsumption hierarchy and are described by 2,795 different properties. With this schema, the DBpedia ontology contains 4,233,000 instances where those that belong to the Person (1,450,000 instances) and Place (735,000 instances) class predominate.

4.3 Proposed Approach

Our approach is based on the general methodology on the general LDQ Assessment pipeline presented by [85]. This methodology comprises four different stages: (i) Preparing the Input Data, (ii) Requirement Validation (iii) Linked Data Validation Analysis (iv) Linked Data Improvement. Figure 1 shows the pipeline of the methodology for the validation of Linked Data. The following sections describe the phases of our methodology. In the next paragraph we described from an high level point of view stage (i) and stage (iv) because our contribution is particularly focused on stage (ii) and Stage (iii).

Figure 4.1: Pipeline of the Methodology proposed for Linked Data Validity
Stage i - Preparing the Input Data

After choosing the GT-KG and TS-KG, respectively ArCo and DBpedia and our specific case, we build a bridge between the two KG exploiting ontology matching and entity alignment techniques exploited manual and automatic tool to accomplish this task (see Related Work section on link discovery for more details). In this way, we create the conditions to compare a set of statements i.e. a subgraph, for the validation process. As we report in the Use Case Section we will start from relevant classes, properties, an entities linked in this stage.

Stage ii - Requirement Validation

In our approach, we have defined data quality dimensions for the internal perspective as accuracy, completeness, consistency, and novelty. The dimensions are based on the quality assessment for linked data presented by Zaveri et al. [104].

The accuracy is related to the degree according to which one or more statements reported in the GT-KG are correctly represented in the TS-KG. The metrics identified for the validation of LD statements are the detection of inaccurate values, annotations, labellings and classifications by comparison with respect to a ground truth dataset.

The completeness validation of an entity in the TS-KG corresponds to the degree according to which information contained in the GT-KG is present in the TS-KG. This can be done be looking to specific statements and the mapped properties. Additionally, besides mapped properties, also Linked Data patterns of one of the KGs mapped to LD patterns of the other KG can be analysed to check the completeness dimension.

The consistency validation means that the Linked Data statements should be free of contradictions w.r.t. defined constraints. The consistency can be represented at schema and data levels. Consistency at the schema level indicates that schema of a dataset should be free of contradictions, and consistency at the data level relies on the absence of inconsistencies in the A-Box in combination with its corresponding Tbox.

The novelty of Linked Data is defined as the set of relevant Linked Data statements that are in the dataset and that are not represented in the ground truth dataset. These Linked Data statements correspond to new predictions that should be validated.

Stage 3 - Linked Data Validation Analysis.

The goal of this phase is to perform the validation specifying metrics that correspond to the 4 dimensions specified in the previous stage.

The accuracy degree of a group of LD statements can be determined, computing the Precision, Recall and F1 score of the number of the statements validated and not validated.

The completeness degree can be computed as follows:

The metric used for computation of the consistency is the number of inconsistent statements in the knowledge graph:

The novelty can be computed as follows:

Stage 4: Linked Data Improvement

In this stage, strategies to address the problems with the invalided statements are implemented. One strategy that can be the implementation of an automatic or semi-automatic system with recommendations for the invalid LD statements.

4.4 Evaluation and Results: Use case/Proof of concept - Experiments

In this Section we present a use case that exploits ARCO as GT-KG and DBpedia as TS-KG. During the stage of “Preparing the input data” mentioned in previous Section, we perform the Ontology Matching and Similar Entity linking between ARCO and DBpedia. In this way we are able to obtain an entity matching between instances of ARCO and instances of DBpedia. For instance, we are able to state that the entity identified in ARCO as Colosseum111http://dati.beniculturali.it/lodview/mibact/luoghi/resource/CulturalInstituteOrSite/20734l is identified to be probably the same entity in DBpedia 222http://it.dbpedia.org/resource/Colosseo/l. Each Arco instance can be related to multiple DBpedia instances and each DBpedia instance can be related to multiple Arco instances. We assume that this relatiness is stored in a separate graph.

We start identifying the most common properties of ARCO classes. As mentioned in the previous paragraph, in our case we focus on the ARCO class333http://dati.beniculturali.it//cis/CulturalInstituteOrSite, counting the most common proporties with the following SPARQL query.

SELECT DISTINCT ?class ?p (COUNT(?p) AS ?numberOfProperties)
WHERE{
Ψ?class a owl:Class .
Ψ?inst a ?class ;
ΨΨ?p ?o .
Ψ# classes cannot be blank nodes + no owl:Thing and owl:Nothing
ΨFILTER (?class != owl:Nothing)
ΨFILTER (?class != owl:Thing)
ΨFILTER (!isBlank(?class))
}
GROUP BY ?class ?p
ORDER BY ?class

According to the results obtained through this query we have chosen properties and values reported in the Table 1 to compute accuracy, completeness, and novelty. The results of this query can be used as a weighing factor for the different validity measures related to properties. This table shows an example of the LD validation of several relevant properties of the real-world entity Colosseum.

Figure 4.2:

Determining the consistency of matched entities using owl:sameAs in combination with an ontology alignment of both the TB-KG and the GT-KG (include restrictions), can be described with the following example:

@prefix cis: <http://dati.beniculturali.it/cis/> .
@prefix core: <https://w3id.org/arco/core/> .
@prefix arco: <http://dati.beniculturali.it/mibact/luoghi/resource/CulturalInstituteOrSite/> .
@prefix dbo: <http://dbpedia.org/ontology/> .
@prefix yag: <http://dbpedia.org/class/yago/> .
@prefix dbr: <http://dbpedia.org/resource/> .

#Arco Tbox
cis:CulturalInstituteOrSite a owl:Class .
core:AgentRole a owl:Class .
cis:CulturalInstituteOrSite owl:disjointWith core:AgentRole .

#Arco Abox
arco:20734 a cis:CulturalInstituteOrSite .

#DBpedia Tbox
dbo:Venue a owl:Class .
yag:YagoLegalActorGeo a owl:Class .

#DBpedia Abox
dbr:Colosseum a dbo:Venue , yag:YagoLegalActorGeo .

#Arco-DBpedia ontology mapping
core:AgentRole owl:equivalentClass yag:YagoLegalActorGeo .

#Arco-DBpedia entity linking
arco:20734 owl:sameAs dbr:Colosseum .

If these graphs are analysed by a reasoning engine, it will come across an inconsistency as the owl:disjointWith restriction is violated. Debugging systems and their heuristic methods can be used by a machine to determine which triples might be causing the inconsistency. In the above case, there might be three triples that could be considered relating to the ontology mapping, the entity linking or a wrongly asserted triple in the TB-KG Abox:

#Arco-DBpedia ontology mapping
core:AgentRole owl:equivalentClass yag:YagoLegalActorGeo .
#Arco-DBpedia entity linking
arco:20734 owl:sameAs dbr:Colosseum .
#DBpedia Abox
dbr:Colosseum a yag:YagoLegalActorGeo .

For a human interpreter, it is quite obvious that that the inconsistency is caused by the wrongly asserted triple in the TB-KG, but machines cannot easily deal with it. We assume there are ten statements on the entity in the TB-KG.

The final computation of the metrics corresponding to four dimensions of Linked Data Validity is as follows:

4.5 Conclusion and Discussion

The paper presents an approach to establish the validity of Linked Data according to an internal perspective considering specific dimensions related to data quality domain, in particular: accuracy, completeness, consistency and novelty. In some cases like cis:Description for ArCo and ontology:Abstract in DBpedia we compare the two statements on the according to such dimension.

In other cases we also focus on ontology patterns. Considering a simple example of geographic information. In the Arco ontology we detect properties like geo:lat and geo:long associated to a specific entity like Colosseum. In DBpedia we can have the concept Point, that specified latitude and longitude, associated to the Colosseum entity. Therefore, to compare such data, we can exploit this kind of pattern.

In some cases we have considered some statements valid according to the internal perspectives instead of external perspective. For instance, we have noticed that geolocated information about the Colosseum are slightly different in case of ArCo and DBpedia. Such statements can be considered valid for establishing a point on a map, but we can need more accuracy if a robot should perform a job in that area. For this specific case we can define a CQ that establish validity for such specific purpose.

Finally, about the novelty dimension, we state in rough way that a statement is novel (and valid) if it appears in DBpedia and not in ArCo. Nevertheless, as future works, we should perform much more analysis on this novel statement in order to establish the validity of this novel statement.

5.1 Related Work

Domain-specific subgraph relevancy

A variety of domain-specific subgraph extraction works have addressed the issue of validity in terms of relevancy. These methods usually employ the relatedness of associated concepts to the domain of interest [61]. The work by Lalithsena et al. [62] considers that the relevancy of a concept to a particular domain can be determined through the type and lexical semantics of the category label associated with that concept. Furthermore, Perozzi et al. [78] proposed a graph clustering with user preference, i.e. the finding a subgraph with regard to the users interest. As opposed to that work, where relevant nodes are determined using the Euclidean distance between nodes, we propose an approach to identify the most relevant subgraph by combining spatial and contextual semantics of nodes at the same time. Our proposed contextual similarity (via Topic Modeling) augmented with KG embedding based approach contributes to identify the nature of predicates, whether they are more responsive to cross-domain or inter-domain relations.

Knowledge Validity

One of the prominent works in automatic KG construction and prediction of the correctness of facts is by Dong et.al [33]. In that work, instead of focusing on text-based extraction, they combined extractions from Web content with some prior knowledge. Bhatia et.al [18] also designed an approach to complement the validity of facts in automatic KGs curation by taking into consideration the descriptive explanations about these facts. Bhatia and Vishwakarma [19] have shown the significance of context in studying the entities of interest while searching huge KGs. However, we propose to extend the context by complementing the spatial neighborhood of entities with the context of predicates (edges) connecting these entities.

Topic modeling

In this report, we also apply the task of topic modeling [102]. In this task, given a dataset of documents, where each document is a text, we try to obtain a set of topics that are present among the documents. The most important step is the grouping of articles, where an article can be present in more than one grouping. Each group corresponds to a topic. Then, in order to map a topic to a label (e.g. sport, health, politics), we look at the frequent words among the articles in that group. One basic way to perform this grouping is by applying Latent Dirichlet Allocation (LDA) [22], which allocates articles into different topics.

Knowledge graph embedding

The purpose of knowledge graph embedding is to embed KG into a low dimensional space by preserving some properties of it. This allows graph algorithm to be computed efficiently. Yao, [103] proposes a knowledge graph embedding algorithm to achieve topic modeling. However, that work does not take into account property values which contain an essential part of the knowledge. Numerous KG embedding techniques have been proposed [25]. In this report, we focus on node embedding algorithms that preserve node position in the graph and thus, graph topology such as Laplacian eigenmaps [15], Random walk [39], DeepWalk [78] and Node2Vec [47].

5.2 Resources

In our approach, we focus on identifying a specific-domain subgraph, given a generic graph. Thus, in general, we can select any generic graph as our input. However, since our approach heavily relies on descriptions of entities for the topic modeling, we need a KG that provides descriptions about entities. Looking at the widely studied generic KGs, we see that DBpedia provides long abstracts. In addition, most of the reviewed approaches use this graph for experimental evaluation, so this choice also enables comparative experimentation.

We are mostly interested in computing the relevance of properties that allow us to enrich an existing dataset with external information. Thus, considering DBpedia, we can identify two kinds of properties:

Hierarchical Predicates

are used for structuring the knowledge and include predicates indicating broader concepts, subclass relations, disjointedness, etc. Often, these predicates will only be used on more abstract entities. To determine the relevance of entities connected with these predicates it is crucial to investigate the entities themselves.

Non-hierarchical Predicates

are typically more context specific and could include predicates like directedBy, writer, actedIn, etc. For these predicates, it is usually not needed to scrutinize each entity separately. Rather, once it is established that the predicate is relevant for the domain, then all nodes connected by it are relevant as well.

Figure 5.2: Non-hierarchical predicates (top) vs hierarchical predicates (bottom)

In the cinematography use case, an example of predicates related to this domain is shown in Figure 5.2 (example by Lalithsena et.al [62]).

5.3 Proposed Concept

We are interested in enriching an existing KG (which we assume to be valid) with information represented in DBpedia. With reference to our definition of validity (the Topic), we want to find properties within the DBpedia KG that are relevant for our specific domain. For example if our existing KG represents Scientists, we are probably interested in properties such as dbo:doctoralAdvisor or dbo:almaMater.

Figure 5.3: Architecture describing the overall pipeline

In this report, we start the investigation of a new approach to find relevant information with reference to a given context, based on topic modeling and graph embedding. Figure 5.3 depicts the pipeline. The first step of the pipeline is to find topics represented by the KG. To find properties related to the domain, we instantiate a typical topic modeling task as follows:

  • We select the set of all properties in the graph.

  • For each property we then collect the set of all entities that appear as object of the property.

  • Given and , we create the document containing the concatenation of the abstract (i.e., the textual description) of all entities in

  • We run a topic modeling task over all documents . The number of clusters is set manually.

    • As a result, we obtain a matrix where the row corresponds to , while each column represents a topic. These topics can be labeled manually by looking at words that are contained in each cluster, i.e., a cluster that contains the words city, lake, neighborhood, and capital could be labeled as Location.

    • A cell value represent the probability that property belongs to topic

  • Based on the above matrix, we are now able to fetch relevant information from DBpedia by selecting properties which have a probability higher than a set threshold .

  • Note that this pipeline, does not give a clear indication the relevance of the values (i.e., objects) of these properties; even if we are able to fetch the correct information (because we know the right property).

After the topic modeling step, the set with properties closest related to each topic can be identified. Then we can find the objects for each of these properties. This will result in a collection of objects/entities that is strongly oriented towards the chosen topic/cluster .

Next, we use graph embeddings to further narrow down the domain oriented list

, to create a more cohesive network, based on spatial topology of the nodes in the graph (since the contextuality has already been taken care of). We can do this by representing nodes as vectors in a space using a graph embedding algorithm that preserves the topological structure of the graph (e.g. DeepWalk). Then, we look up nodes of

in the embedded graph and compute outliers of the embedded space. Once the outliers are identified, we can remove isolated objects in

and remove them; and then recreate a graph with the remaining objects.

Now, for each subgraph , we analyze the properties of each object. We analyze how often each property has a property path with nodes that are not a part of this subgraph; we do this for all the properties of every object in the graph. Then we normalize this score in the range , which given an indication how often a predicate takes us out of the domain.

This process is repeated for all topics generated during the topic modeling phase. In the end we can determine whether the behavior of properties has a pattern throughout different topics. This can help us in determining if there are certain properties that have a tendency to take us out of the domain while some may not.

This approach can help us to be selective with our approach while expanding the semantics for the data in a given scenario. We can accordingly choose the properties to expand the semantics depending on whether it requires more cross-domain knowledge or retain the scope of semantics to be within the domain.

5.4 Proof of concept and Evaluation Framework

In this section we describe a methodology to test the proposed approach. Possible metrics for evaluating different aspects of our work can be grouped as follows:

Graph reduction

these metrics give an indication of the capability of our approach to reduce a generic graph to a smaller specific-domain subgraph.

Impact on accuracy and recall

these metrics demonstrate the performance of our approach in terms of accuracy (i.e., relevance of the retrieved entities, non-relevant retrieved entities, missed entities, etc.)

Impact on run-time

we have to measure how much time we can save using the proposed approach, instead of running ad-hoc queries in order to retrieved entities related to manually selected properties.

Application based evaluation

in the end, the data collected by the approach would be used as part of another application(e.g., a recommender systems). An investigation would measure how our approach is able to ease the enrichment phase in different application domains.

To get an impression of the feasibility of what we propose, we already did some initial experiments. We performed an n-hop expansion of hierarchical categories in DBpedia. We traversed the DBpedia categories connected by the skos:broader relation starting from the root node of four topics (Databases, Datamining, Machine_Learning, and Information_Retrieval). Table 5.1 shows our results using n-hop expansion technique.

Root category Number of hops Number of subcategories extracted
Databases 8 880
Datamining 8 15
Machine Learning 8 2193
Information Retrieval 8 8557
Table 5.1: Analysis of different topics subgraph sizes with the same number of hops traversed.

It is evident that for the same number of hops selected (8 in this case) we obtained varying amounts of subcategories using the n-hop expansion technique. Our approach is supposed to automatically extract the most relevant subgraph irrelevant of the number of hops traversed. We also present an initial analysis on the effect of number of hops traversal with respect to number of subcategories extracted for a particular domain Film in Table 5.2 below.

Number of hops Number of subcategories extracted
20 1048799
10 220311
5 25425
Table 5.2: Analysis of the number of hops expansion for a particular domain.

Table 5.2 shows that given a particular domain of interest, n-hop expansion subgraph extraction can provide a diverse size of subcategories. The selection of the most relevant subgraph here depends on manual selection with the performance of the graph for intended applications. However, we propose to evaluate our automatic topic driven approach with respect the most relevant n-hop expansion subgraph.

Precision, recall, execution time, and comparison with topic modelling approaches and knowledge graph embedding approaches. As they are generally available and widely used in research, we suggest to evaluate our approach using DBpedia and Wikidata. To conduct this evaluation, on would:

  • Select multiple topics

  • Manually extract specific domain subraphs from DBpedia and Wikidata (which are then used as a gold standard)

  • For each topic, generate the specific topic subgraph from DBpedia and Wikidata using our approach and state of the art in knowledge graph embedding and topic modelling.

  • Measure the execution time.

  • For each execution and identified subgraphs, compute precision and recall.

  • Compare this approach with results of others.

We predict that our approach would be able to obtain a higher precision and recall, but worse execution time than the state of the art.

5.5 Conclusion and Discussion

In this report, we analyzed the concept of Linked Data validity from a specific perspective, namely the problem of enriching a domain-specific subgraph considering relevancy of a property or an entity to the domain from generic KGs. Then, we suggested an approach based on topic modeling and knowledge graph embedding. We also have designed a preliminary experiment in order to evaluate our proposed approach.

Instead of a specific topic, our approach can be applied to any other topic. This is interesting since there can be many topics in a generic KG. Moreover, the topic is obtained both spatially and contextually. We have also analyzed the tendency of how often a property takes us out of (has paths to entities outside of) the domain.

As a future work, a complete evlaution of this method would be needed. Besides, more sophisticated methods could be applied.

6.1 Related Work

The aim of this work is to design a systematic approach to investigate patterns in LOD and validate the facts in LOD in the perspective of commonsense knowledge within the context of human’s actions. A few recent studies focus on semantically enriching knowledge graphs with commonsense knowledge. Common sense is elicited either from language features and structures or by inherent notions formalised in foundational ontologies trough semantic alignment. Recent approaches apply machine learning. Very recent works try application of deep learning to infer common sense knowledge from large corpora of texts.

Classification-Based Approaches.

Asprino et al. [12]

focus on the assessment of foundational distinctions over LOD entities, hypothesizing they can be validated against common sense. They aim at distinguishing and formally asserting whether an LOD entity refers to a class or an individual, or whether the entity is a physical object or not, foundational notions that are assumed to match common sense. They design and execute a set of experiments to extract these foundational notions from LOD, comparing two approaches. They first transform the problem into a supervised classification problem, exploiting entities features extracted from the DBpedia knowledge base; namely: the entity abstract, its URI and the incoming and outgoing entity properties. Then, the authors compare this method with an unsupervised alignment-based classification that exploits the alignments between DBpedia entities and WordNet, Wiktionary and OmegaWiki, linked data encoding lexical and linguistic knowledge. The authors run the final experiment to validate the results against common sense using first crowdsourcing and expert-based evaluation. Our contribution is inspired from this prior work and we intend to extend the work and design a classification process for actions related to human beings according to common sense.

Alignment-Based Approaches.

Other works exploit foundational ontology-based semantic annotation of lexical resources that can be used to support commonsense reasoning, i.e. to make inferences based on common sense knowledge. Gangemi et al. [42] made a first attempt to align WordNet upper level synsets and the foundational ontology DOLCE, extended by Silva et al. [92] verbs in order to support also common sense reasoning on events, actions, states and processes.

Deep-Learning-Based Approaches.

Other works assume that contextual common sense knowledge is captured by language and try to infer it from a part of the discourse or from text corpora for question answering. Recently, Neural-Based Language Models trained on big text corpora have been applied to improve natural language applications, suggesting that these models may be able to learn common sense information. Larz Kunze et al. [60]

presented a system, that converts commonsense knowledge from the large Open Mind Indoor Common Sense database from natural language into a Description Logic representation, that allows for automated reasoning and for relating it to other sources of knowledge. Additionally, Trinh and Le et. al

[98] focus on commonsense reasoning based on deep learning. The authors use an array of large RNN language models that operate at word or character level on LM-1-Billion, CommonCrawl, SQuAD, Gutenberg Books, and a customized corpus for this task and show that diversity of training data plays an important role in test performance. Their method skipped the usage of annotated knowledge bases. However, in this work our aim is focussed on the validation of LOD facts from the common sense perspective. In order to identify the actions, this work can be extended.

Commonsense Knowledge Bases.

Another notable work in the commonsense knowledge domain is the crowdsourced machine readable knowledge graph ConceptNet.. OpenCyc represents one of the early works of commonsense knowledge, which includes an ontology and uses a proprietary representation language. As a result, the direct usage of both these commonsense knowledge bases as a backbone for applications related to intelligent systems. As already mentioned, in this work we intend to validate the triples from the ConceptNet according to common sense.

6.2 Resources

In this section we introduce the resources used in this work. Framester is a hub between FrameNet, WordNet, VerbNet, BabelNet, DBpedia, Yago, DOLCE-Zero and ConceptNet as well as other resources. Framester does not simply create a strongly connected knowledge graph, but also applies a rigorous formal treatment for Fillmore’s frame semantics, enabling full-fledged OWL querying and reasoning on the created joint frame-based knowledge graph. ConceptNet, originated from the crowdsourcing project Open Mind Common Sense, is a freely-available semantic network, designed to help computers understand the meanings of words that people use. Since the focus for this work is to validate the facts in the LOD from the perspective of common sense, therefore ConceptNet has been used as the primary dataset. However, since both DBpedia and ConceptNet is contained within Framester, the link between them has also been leveraged. In order to identify only the types of actions performed by human beings we considered other two image datasets, namely ‘UCF101: a Dataset of 101 Human Actions Classes From Videos in The Wild’ [2] and ‘Stanford 40 Actions’ [1] as background knowledge. UCF101 is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. On the other hand, the Stanford 40 Action Dataset contains images of humans performing 40 actions. In each image, we provide a bounding box of the person, who is performing the action, indicated by the filename of the image. There are 9532 images in total with 180-300 images per action class. Only the labels of actions from these two datasets are extracted to identify the possible types of actions, which could be performed by human beings. Therefore, the data collection can be viewed upon as a two step process on a broad level:

  • Identify the types of actions that could be performed by human beings from UCF101 and Stanford 40 Actions dataset.

  • Find triples from the ConceptNet, which are related to these type of actions.

6.2.1 Proposed approach

As already mentioned, the goal of our work is to propose the model as classification problem for the validation of the triples from the LOD from the perspective of commonsense knowledge. We intend to classify the triples into two classes: commonsense knowledge and not commonsense knowledge. The approach can be defined as a 4-step process:

  • Select triples from the knowledge base.

  • Annotate the triples using crowd sourcing approach.

  • Generate vectors for each triple using graph embedding.

  • Classify the vectors using supervised classifiers.

After collecting the triples from the knowledge bases, we annotated the data collected as a proof of concept for the design using crowd sourcing approach. The features for the classification problem are generated using the RDF2Vec [81] graph embedding algorithm.

RDF2Vec

[81]

is an approach of latent representations of entities of a knowledge graph into a lower dimensional feature space with the property that semantically similar entities appear closer to each other in the feature space. Similar to the word2vec word vectors, these vectors are generated by learning a distributed representation of the entities and their properties in the underlying knowledge graph. The vector length can be restricted to 200 features. In this embedding approach, the RDF graph is first converted to a sequence of entities which can be considered as sentences. The generation of these sequence of entities is done by choosing node subgraph depth d. This depth d, is the number of hops from the starting node. Using the hops, the connection between the ConceptNet and DBpedia can be leveraged since these 2 knowledge graphs do not share a direct common link. However, it is to be noted in this case all the prefixes from both the knowledge graphs are kept intact in their namespace to identify the properties with same labels in the vector space.

Classification Process

The vectors generated from RDF2Vec can be directly used for the classification process. For the validation of the LOD facts, we design a binary classifier, in which we intend to classify triples as commonsense knowledge and not commonsense knowledge. The classifiers used for the purpose are Random Forest and SVM.

Random Forest.

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operates by constructing a multitude of decision trees at training time and outputting the class, that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

Support Vector Machine (SVM). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category, based on which side of the gap they fall.

6.3 Experimental Setup

Data Collection for Proof of Concept.

A preliminary set of candidate common sense triples has been selected from ConceptNet, focusing triples related to human actions. Exploring these triples, and in particular the triples properties, we performed a manual alignment towards other knowledge graphs, such as DBpedia to search for connected facts, and selected candidates triples for domain knowledge or general knowledge. As a proof of concept, triples have been manually selected, using the ConceptNet GUI, but the approach can be automated by querying the ConceptNet SPARQL endpoint, exploring connected triples in ConceptNet, and align with other knowledge graphs.

Automated Data Collection.

The triples for the task can be extracted automatically in a couple of possible ways. Since the data comprises of actions involved by human beings, taking into consideration verbs and finding the triples surrounding the verbs could be useful. Also, selection of proper frames from Framester could possibly lead us to generate appropriate triples for the task. Moreover, the word embedding vectors, which are available for ConceptNet, could be used to identify the triples from the knowledge base, by taking into account the vectors which are in close proximity in the vector space.

Annotation by crowdsourcing.

The dataset created as proof of concept has been annotated, to distinguish common sense from general knowledge, using crowdsourcing. A preliminary set of demonstrative multiple-choice questions has been prepared at this scope, uploaded on Google Forms and proposed to some ISWS 2018 attendants. In Figure  6.1, we show the interface and an example of question to distinguish common sense and domain knowledge facts related to surfing. An important aspect of common sense is the context of the situation, which needs to be taken into account to distinguish common sense from general knowledge. Indeed, common sense reasoning is not meant to derive all possible knowledge, that is usually not formalised in knowledge graph, but only the one, that is relevant for the situation. Indeed, a fact could be a common sense for a specific situation, but not in another one.

According to the agreed crowdsourcing methodological process, after an initial test run with volunteers, the questionnaire has been revised and adapted, taking into consideration the comments received from the participants. In particular, the task description and the preliminary considerations have been improved, including an example of what common sense is.

This work can be extended with additional questions and extending the list of question choices, by exploring automatically the knowledge graph associated to each selected activity.

Figure 6.1: Use of crowdsourcing for triple annotation

Figure  6.2 shows data collected from one of the 5 questions we asked to 14 people. All the data are available at online111https://docs.google.com/forms/d/1E9dpMcTBz27KjBq9ZoKQxrD8RWOi3t4E4tneCmgXLg0/viewanalytics

While the results for some triples confirm the association with common sense, like for instance the one claiming that you need a surfboard in order to surf, some of them seem to represent some ambiguity. For instance, it is no clear why surf has as a prerequisite the fact that you have to go to San Francisco. In this case, we expected to obtain very homogeneous results, since everyone knows that San Francisco is a wonderful place for surfing, but not the only one in the world. Hence, ambiguity represents one of the most important results we need to discuss.

Subject Predicate Object Validity
swim usedFor exercise 1
run causes breathlessness 1
disease causes virus 1
shower UsedFor Clean your Tooth 0
eat causes death 0
climb usedFor go up 1
smoking hasPrerequisite cigarette 1
Table 6.1: Results from the crowdsourcing annotation.
Figure 6.2: Example of question in the survey
Discussion on the Crowdsourced annotated data.

As previously discuss, results showed a certain degree of ambiguity. We identified three main possible reasons. First, there could be users with a low reliability. There are several strategies to detect them, for instance by collecting a statistical-meaningful set of results or by using some golden questions. We put some of them, and we will use and extend them in future investigations. Golden questions can be used in the crowdflower platform, that implements some automatic mechanism for computing reliability and trust score for workers.

Second, ambiguity can be strictly related to language itself or just by some misunderstanding of the question, maybe because of the user’s cultural background.

Third, ambiguity can simply represent those concepts in the middle between common sense and what we consider general knowledge. Actually, these results can contain some important information [6] about the users that participated, like, for instance, if their knowledge or common sense is cultural-biased. In order to retrieve this kind of information, we will try to make clusters of data basing on the geographical region of each person that will make the survey.

6.4 Discussion and Conclusion

In this work we have investigated potential approaches for common sense annotation of LOD facts, to distinguish common sense from general knowledge in the context of a discourse. We propose an approach addressing the following research question:

Is there any pattern in LOD that allows us to distinguish a generally valid statement (e.g. common sense fact) from a context dependent one? If yes, why?

We partially address also the following question: What is a proper model for representing LOD validity?

Specifically, this work is a contribution to the areas of common sense reasoning and semantic web. The automatic tagging of common sense facts could help enlarge existing knowledge graphs with additional facts, which can be inferred on the basis of common sense knowledge. The approach, described herein, is inspired from the current trends in the literature, and proposes the application of supervised classification, to distinguish commonsense knowledge from domain knowledge. The proof of concept described in this work leverage existing sources of common sense, specifically ConceptNet and Framester, and expanded to other knowledge graphs through alignment to expand the domain of the discourse. Frames, in particular, look promising to identify the sets of facts, potentially related to common sense. A crowdsourcing experiment has been also designed and run as a proof of concept, demonstrating crowdsourcing may be used to produce annotated datasets useful for training a classifier.

The demonstrative proof of concept described in this paper may be evolved in an automatic approach where SPARQL queries are used to construct the knowledge base used for training, and LOD properties are used to expand the knowledge base starting from initial seeds. In our experiments we initially considered a list of human actions and started analysing common sense related to these actions to define the approach. Analogously, other seeds may be identified considering other potential topics of common sense.

This study has identified potential future lines of investigation. In particular, the dependency of common sense from the context, which has been underlined in this work as the context of the discourse for the crowdsourcing annotation step, could be expanded to consider also the effect of the cultural bias, which affects the perception of what common sense is. Clearly, if common sense is knowledge that is acquired on the basis of experience, the learning environment is an important aspect to be taken into account. On the same line, also time, age and sex of the people, involved in the discourse, may potentially bias the distinction between common sense and general knowledge. In some contexts, there could also be no clear distinction between stereotypes and common sense.

Other potential directions of investigation could explore alternative machine learning techniques, including deep learning. Despite the results obtained using unsupervised machine learning approaches are promising, the choice of the corpora used for learning clearly affect the quality. This shortcoming could be mitigated by the fast expansion of LOD; however, robust statistical approaches such as Bayesian Deep Learning can be investigated. The investigation of the relationships between stereotypes and jokes and common sense could be interesting from a social science perspective and could also help prepare cleaner datasets to be used for training.

7.1 Related Work

The problem LD validity and more in general KGs validity has not been largely investigated in the literature. An aspect somehow related and investigated in the literature is instead the assessment of LD quality. In this section, we first briefly explore the main state of the art concerning LD quality, hence the literature concerning constraint representation and extraction is surveyed.

Data Quality. LD quality is a widely explored field in the semantic web research, in which context validity is included at times. Clark and Parsia defined validity in terms of data correctness and integrity. Zaveri et al.[104] create a famework to explore linked data quality where validity is seen as one dimension of LD quality. In this survey, Zaveri et al. classify data quality dimensions under 4 main categories: accessibility, intrinsic properties, contextual and representational dimensions. In a more application-oriented work [43], validation is proposed through the usage of Shape Expression Language. Both cited works lack of a clear definition of validity that we aim at providing in this work.

Constraint representation. There are many ways to represent constraints in RDF graphs. Tao et al. [95] propose Integrity Constraint (IC), a constraint representation using an OWL ontology and specifically by using OWL syntax and structure. Fischer et al.[38] introduce RDF Data Description to represent constraints in a RDF graph. In their approach, an RDF dataset is called valid (or consistent) if every constraint can be entailed by the graph.

Constraints Extraction. Related works concerning learning rules from KGs can be found in the literature. One way of doing so is mostly by exploiting rule mining methods [34, 31]. Here rules are automatically discovered from KGs and represented in SWRL whilst, we propose to use SHACL for representing constraints that are learned from KGs and that are ultimately used for validating (possibly also new) statements of a KG. Another solution for mining logical rules from a KG is represented by AIME system [41] and its upgrade AIME+[40]

where, by exploiting Inductive Logical Programming solutions, a method for reducing the incompleteness of a KG while taking into account taking into account the Open World Assumption is proposed, differently from our goal aiming at on validating statements of a KG.

7.2 Proposed Approach

The problem we want to solve is learning/finding constraints for a RDF KG by exploiting the evidence coming from the data therein, hence apply the discovered constraints on (potentially new) triples in order to validate them. In this section, we specifically draft a solution for learning three types of constraints (see end of Sect. LABEL:sec:LD_Validity for details on them) reported in the following, hence we briefly present the validation process once constraints are available.

  • Cardinality constraints.

  • Class constraints.

  • Datatype constraints.

Cardinality constraints. Cardinality constraints can be detected through the usage of existing statistical solutions. Specifically, given a set T of triples for a given property , maximum (risp. minimum) cardinality constraints (under some considerations about the domain of interest) could be assessed by statistically inspecting the number of triples available for the considered KG. An example of definition of a maximum cardinality expressed by SHACL is reported in the following, where by statistically inspecting the available data, the conclusion that us learned is that each person can have at most one birth date.

ex:MaxCountExampleShape
Ψa sh:NodeShape ;
Ψsh:targetNode ex:Bob ;
Ψsh:property [
ΨΨsh:path ex:birthDate ;
ΨΨsh:maxCount 1 ;
Ψ] .

Class constraints. Class constraints require that individuals that participate in a predicate should be instances of certain class types. To find this kind of constraints a straightforward way to go could be querying the KG in order to find the classes to which individuals participating in the predicate belong to, and then assuming that all retrieved classes are valid for the property. However, KGs may contain noisy data and, therefore, some classes should not been considered for the class constraint of the property. An alternative way for approaching the problem could be exploiting ML approaches, and specifically concept learning approaches [37, 63, 24, 82] for assessing the concept description actually describing the collection of individuals participating in the predicate. Concept learning approaches are indeed more noisy tolerant and as such they would be more suitable for the described scenario. In the following, an example of class constraint expressed by SHACL is reported.

ex:ClassExampleShape
Ψa sh:NodeShape ;
Ψsh:targetNode ex:Bob, ex:Alice, ex:Carol ;
Ψsh:property [
ΨΨsh:path ex:address ;
ΨΨsh:class ex:PostalAddress ;
Ψ] .

Datatype constraints. Datatype constraints require that individuals that participate in a predicate should be instances of certain Literal types (numeric, String, etc). Here we assume that for a considered property there could be only one datatype. We focus on two kinds datatypes: numeric (integer) and string, but other datatypes can be further investigated. The approach that is envisioned is described in the following. Given a property , the set of all objects that are related to are collected. Then, based on the datatype occurrences related to , a majority voting criterion is applied defining the most common datatype value. Alternative approaches could be also considered. Specifically:

  • the exploitation of methods for performing regression tasks if the datatype is Integer.

  • the exploitation of embeddings methods for performing similarity-based solutions between values, if the type is string.

String embeddings can be computed by using algorithms at the state of the art as Google word2vec [71], Glove [28] and so on. In the following, an example of Datatype constraint in SHACL is reported.

ex:DatatypeExampleShape
Ψa sh:NodeShape ;
Ψsh:targetNode ex:Alice, ex:Bob, ex:Carol ;
Ψsh:property [
ΨΨsh:path ex:age ;
ΨΨsh:datatype xsd:integer ;
Ψ] .
Matching the SHACL constraints to RDF dataset.

The SHACL instance graph identifies the nodes in the data graph selected through targets and filters and that will be compared against the defined constraints. The data graph nodes that are identified by the targets and filters are called ”focus nodes”. Specifically, focus nodes are all nodes in the graph that:

  • match any of the targets, and

  • pass all of the filter Shapes.

SHACL can be used for documenting data structures or the input and output of processes, driving user interface generation or navigation because these processes all require testing some nodes in a graph against shapes. The process is called ”validation” and the result is called a ”validation result”. The validation fails if validating any test against each ”focus node” returns fail, otherwise the validation is passed.

7.3 Evaluation

We want validate a KG by assessing the validity of triples in the KG with respect to a set of constraints. As illustrated in the previous section, our hypothesis is that (some) constraints may be learned from the data. As such, the aim of this section is to set up an evaluation protocol for assessing the effectiveness of the constraints that are learned from the data. Formally, We hypothesize (H1) Our approach is able to learn constraints expressed in SHACL to be used for identifying valid triples. Given H1, we evaluate our approach on the following research questions:

  • RQ1 Can we cover a majority of triples in the KG with our constraints?

  • RQ2 Are the constraints contradicting?

  • RQ3 Are the triples identified as valid plausible to a human?

Research Questions Evaluation Results
RQ1 Can we cover a majority of triples in the KG with our constraints? automatic, number of triples covered by the constraints percentage of triples covered (the higher, the better)
RQ2 Are the constraints contradicting? look at all extracted constraints, evaluate contradictions no constraints should be contradicting
RQ3 Are the triples identified as valid / plausible to a human? expert experiment, ask experts to evaluate plausibility all (or a high percentage) of validated triples should be plausible to humans
Table 7.1: Research questions for evaluation and how they are applied.

RQ1 looks into how many triples can be covered by the constraints, to get an idea of how comprehensive the extracted rules are. The metric to be used for the purpose is based on counting the number of triples that are validated by the constraints, as well as the number of triples that are not valid according to the constraints. This evaluation can provide an insight into how comprehensive the learned constrains are, and how much of the KG can be somehow covered.

The goal of RQ2 is either to ensure that no contradicting constraints are learned or alternatively to assess the impact of contradicting constraints with respect to all constraints that are learned. Furthermore understanding the reason for having contradicting constraints would be very important in order to improve the design of the proposed solution so that limiting such an undesired effect.

Finally, with RQ3, we want to assess whether the triples identified as valid by the learned constraints are plausible to a human. For the purpose a survey with Semantic Web experts is envisioned. It could be conducted as follows. First of all, all valid triples coming from the discovered constraints are collected. Hence a sample of the randomly selected valid triples is obtained. The cardinality of the sample should depends on the number of triples validated by each type of constraint. The selected sample is provided to a group of experts jointly with the instruction to mark which triples are considered as valid, invalid, and/or that seem plausible, i.e. where the content might be wrong but it could possible in the real world. For example, Barack Obama married_to Angelina Jolie is not correct, but somehow possible.

7.4 Discussions and Conclusions

We introduced an approach to discover/learn constraints from a Knowledge Graph. Our approach relies on a mixture of statistical and machine learning methods and on SHACL as representation language. We focused on three constraints, namely cardinality constraints, class constraints, and datatype constraints. We also presented an evaluation protocol for our proposed solution. While our approach is limited to the discussed constraints, it can be seen as a good starting point for further investigations of the topic.

7.5 Appendix

This section is aimed to show a proof of concept for the proposed solution. The adopted data collection for the purpose is ArCo, containing a plethora of resources belonging to Italian cultural heritage. The dataset has been examined by exploiting SPARQL queries. Some of them have been reported in the following.

The dataset contains around 154 classes. We focused on ArCo:CulturalEntity concept acting as the root of our exploration for a total of about 20 classes inspected. As for the rest of the main reachable concepts, we found unknown names, only numeric identifiers, as it is shown in Figure  7.1 where results have been collected by using Query 1.

As for ArCo:CulturalEntity, several subclasses can be discovered. Queries 2, 3, 4 and 5 show how to obtain this information. Figure 7.2 depicts a diagram of the concept hierarchy and relationships among classes, for instance, ArCo:NumismaticProperty.

In order to exemplify each type of constraint explained in Sect. 7.2, we focus on a resources belonging to the class NumismaticProperty. The list of properties that are involved with NumismaticProperty is found by using Query 6. Hence the resource <https://w3id.org/arco/resource/NumismaticProperty/0600152253>, which name is “moneta RIC 219”, is selected. Query 7 is used for retrieving the resource related information that can be used as a case of possible constraints learnt from the data.

The property hasAgentRole can have 2 different possibilities in its range given the same resource at its domain.

The property hasConservationStatus points to a resource belonging to ConservationStatus class. Thus, the class expected in the range of this property should be ConservationStatus.

Datatype constraints:

http://www.w3.org/2000/01/rdf-schema#comment moneta, RIC 219, AE2, Romana imperiale

In this case, the property rdfs:comment should has a string at the range part.

Figure 7.1: Main roots found in ontology.
Figure 7.2: Distribution of subclasses of CulturalEntity. NumismaticProperty is subclassOf MovableCulturalProperty, which is subClassOf TangibleCulturalProperty, which is subClassOf CulturalProperty, and this, is subClassOf CulturalEntity.

Query 1: Classes that acts as roots

Select distinct ?nivel0
Where {
?nivel1 rdfs:subClassOf ?nivel0 .
?nivel2 rdfs:subClassOf ?nivel1 .
?nivel3 rdfs:subClassOf ?nivel2 .
?nivel4 rdfs:subClassOf ?nivel3 .
}

Query 2: First level subclasses from ArCo:CulturalEntity

Query 4: Third level subclasses from ArCo:CultrualEntity

Select distinct (<http://dati.beniculturali.it/cis/CulturalEntity>) as ?level0 ?level1 ?level2 ?level3
where {
?level1 rdfs:subClassOf <http://dati.beniculturali.it/cis/CulturalEntity> .
?level2 rdfs:subClassOf ?level1 .
?level3 rdfs:subClassOf ?level2 .
}

level3:
https://w3id.org/arco/core/ArchaeologicalProperty
https://w3id.org/arco/core/ImmovableCulturalProperty
https://w3id.org/arco/core/MovableCulturalProperty

Query 5: Fourth level subclasses from ArCo:CultrualEntity

Select distinct (<http://dati.beniculturali.it/cis/CulturalEntity>) as ?level0 ?level1 ?level2 ?level3 ?level4
Where {
?level1 rdfs:subClassOf <http://dati.beniculturali.it/cis/CulturalEntity> .
?level2 rdfs:subClassOf ?level1 .
?level3 rdfs:subClassOf ?level2 .
?level4 rdfs:subClassOf ?level3 .
}

level4:
https://w3id.org/arco/core/ArchitecturalOrLandscapeHeritage
https://w3id.org/arco/core/HistoricOrArtisticProperty
https://w3id.org/arco/core/MusicHeritage
https://w3id.org/arco/core/NaturalHeritage
https://w3id.org/arco/core/NumismaticProperty
https://w3id.org/arco/core/PhotographicHeritage
https://w3id.org/arco/core/ScientificOrTechnologicalHeritage

Query 6: Properties about resources belonging to NumismaticProperty

Select distinct ?p
Where {
?s ?p1 <http://www.w3id.org/arco/core/NumismaticProperty> .
?s ?p ?o .
}
p                   http://www.w3.org/1999/02/22-rdf-syntax-ns#type
p.1                 http://www.w3.org/2000/01/rdf-schema#label
p.2                 http://www.w3.org/2000/01/rdf-schema#comment
p.3                 https://w3id.org/arco/catalogue/isDescribedBy
p.4                 https://w3id.org/arco/core/hasAgentRole
p.5                 https://w3id.org/arco/core/hasCataloguingAgency
p.6                 https://w3id.org/arco/core/hasHeritageProtectionAgency
p.7                 https://w3id.org/arco/core/iccdNumber
p.8                 https://w3id.org/arco/core/regionIdentifier
p.9                 https://w3id.org/arco/core/uniqueIdentifier
p.10                https://w3id.org/arco/location/hasTimeIndexedQualifiedLocation
p.11                https://w3id.org/arco/objective/hasConservationStatus
p.12                https://w3id.org/arco/subjective/hasAuthorshipAttribution
p.13                https://w3id.org/arco/subjective/hasDating
p.14                https://w3id.org/arco/location/hasCulturalPropertyAddress
p.15                https://w3id.org/arco/objective/hasCulturalPropertyType
p.16                https://w3id.org/arco/objective/hasCommission
p.17                https://w3id.org/arco/core/suffix
p.18                https://w3id.org/arco/subjective/iconclassCode

Query 7: NumesmaticProperty resource data example <https://w3id.org/arco/resource/NumismaticProperty/0600152253> , “moneta RIC 219”.

Select distinct ?p ?o
Where {
\url{<https://w3id.org/arco/resource/NumismaticProperty/0600152253>} ?p ?o .
}
Figure 7.3: Data related to the resource arco https://w3id.org/arco/resource/NumismaticProperty/:0600152253

Query 8: related data to <https://w3id.org/arco/resource/ConservationStatus/0600152253-stato-conservazione-1> linked to NumesmaticProperty resource selected by the property hasConservationStatus

Select distinct (<https://w3id.org/arco/resource/ConservationStatus/0600152253-stato-conservazione-1>)
as ?s ?p ?o
Where {
<https://w3id.org/arco/resource/ConservationStatus/0600152253-stato-conservazione-1> ?p ?o .
}

8.1 Related Work

Until now, progress on the ontology consistency and using shapes to validate RDF data have been separate from each other.

  • How to Repair Inconsistency in OWL 2 DL Ontology Versions? [14]. In this paper, the authors have developed an a priori to checking ontology consistency. The used definition of consistency encompasses syntactical correctness, the absence of semantic contradictions and generic style constraints for OWL 2 DL.

  • ORE: A Tool for the enrichment, repair and validation of OWL based knowledge bases [14]. ORE uses OWL reasoning to detect inconsistencies in OWL based knowledge bases. It also uses the DL-Learner framework that can be used to detect potential problems if instance data is available. However, relying only on reasoners is not suitable for treating large knowledge graphs like LOD, due to scalability and due to the fact that reasoners cannot detect all inconsistencies in data.

  • Using Description Logics for RDF Constraint Checking and Closed-World Recognition [76]. There the authors discuss various approaches to validate data by enforcing certain constraints, including SPIN rules in TopQuadrant products, ICV (Integrity Constraint Violation) in Stardog, and OSLC, ShEx, and SHACL shapes. The authors note that shapes are most “similar to determining whether an individual belongs to a Description Logic description”.

  • TopBraid Composer [8] allows to convert some OWL restrictions into a set of SHACL constraints. The downside of the presented approach is that the constraints are produced from the assumption that the ontology designers desired to apply a closed-world setting in their ontologies. For example, the ‘rdfs:range’ axiom from the OWL ontology will be naively translated into a ‘class’ or a ‘datatype’ constraint. Such translation will effectively prevent the case where the ontology engineer envisioned a case where a property ‘dad’ pointing to an instance of ‘Male’ would allow to infer that instance also to be a ‘Father’. Further, class disjointness and more complex axioms where values of ‘owl:allValuesFrom’, ‘owl:someValuesFrom’, ‘owl:hasValue’ or ‘owl:onClass’ are intersections of other restrictions are not handled.

8.2 Resources

In this work we are using two datasets to exemplify our approach: Bio2RDF and Wikidata.

Bio2RDF.

An open-source project that uses semantic web technologies to help the process of biomedical knowledge integration [16]. It transforms a diverse set of heterogeneously formatted data from public biomedical and pharmaceutical databases such as KEGG, PDB, MGI, HGNC, NCBI, Drugbank, PubMed, dbSNP, and clinicaltrials.gov into a globally distributed network of Linked Data, through a unique URL, in the form of http://bio2edf.org/namespace:id. BioRDF with 11 billion triples across 35 datasets, provides the largest network of Linked Data for the Life Sciences applying Semanticscience Integrated Ontology which makes it a popular resource to help solve the problem of knowledge integration in bioinformatics.

Wikidata.

A central storage repository maintained by the Wikimedia Foundation. It aims to support other wikis by providing dynamic content with no need to be maintained in each individual wiki project. For example, statistics, dates, locations and other common data can be centralized in Wikidata. Wikidata is one of the biggest and most cited source in the Web, with 49,243,630 data items that anyone can edit.

8.3 Proposed Approach

The proposed approach aims to check the “consistency” of Linked Open Data (LOD). Inconsistency happens when contradictory statements can be derived (from the data and ontology) by a reasoner. The use of OWL reasoners to detect such inconsistencies has a very high complexity, therefore, it might not scale to big datasets. Consequently, we propose the use of a validation mechanism, ensuring that data in an RDF base satisfy a given set of constraints, and preventing the reasoners from inferring unwanted relationships. Moreover, it enables the portability and reusability of the generated rules over other data sources.

The definition of such constraints can come from several sources: (1) Knowledge engineer expertise, (2) ontologies, and (3) data. In this work we are only considering the definition of constraints through ontologies. Moreover, instead of expressing these SHACL constraints manually, we aim to automatically generate them from a set of ontology axioms. With this approach we allow a fast (although not necessarily complete) checking of the data consistency.

This approach shall include a proof (or an indication of a possibility of such proof) that the shape constraints derived from the ontology will be sound. By soundness, we mean that the set of the derived constraints will not prevent data, which would otherwise be valid and could be reasoned over without inconsistencies, from being inserted into the triplestore. Completeness (i.e. that every inconsistency-creating data would be detected by a SHACL shape and thus rejected) may not be achieved because the standard version of SHACL does not include a full OWL reasoner. Therefore, we cannot guarantee that any inconsistency arising after several steps of reasoning will be detected by SHACL shapes. An example of undetected inconsistency will be provided after the definition of the rules that we use to derive SHACL constraints from the ontology.

We below present the rules for deriving some of the description logic formulas into SHACL constraints.

  • Rule 1: cardinality restriction. For every property R for which it is stated that R has cardinality at most n, we add the SHACL shape

        ex:CardialityRestriction a sh:PropertyShape ;
    ΨΨsh:path R ;
    ΨΨsh:maxCount 1 .
        
    

    If the the same property also has an ‘rdfs:domain’ axiom with a class DC, a more specific shape may be added:

    ex:SpecificCardialityRestriction a sh:NodeShape ;
        sh:targetClass DC ;
        sh:property [
          sh:path R ;
          sh:maxCount 1;
        ] .
    
    
  • Rule 2: datatype restriction (range). For every property R whose range is the class C, and all class D which is explicitly disjoint from C, we add the SHACL shape

        DatataypeRestriction a sh:PropertyShape ;
        sh:path R ;
        sh:not [
       Ψ sh:datatype D ;
        ] .
    
  • Rule 3: class disjointness. For any classes C and D that are explicitly stated to be disjoint, we add the SHACL shape

    DisjointnessShape a sh:NodeShape ;
    Ψsh:targetClass C ;
    sh:property [
            ΨΨsh:path rdf:type ;
    ΨΨΨsh:not [
            ΨΨΨsh:hasValue D ;
    ΨΨΨ]
            Ψ].
    

Note that SHACL makes subclass inferences, so any instance that explicitly belongs to two classes C’ and D’ that are respective subclasses of C and D, would be rejected during SHACL validation. However, this SHACL shape does not cover all cases of inconsistencies arising from disjointness of classes. An example of such a case (involving self-cannibalism) is presented below.

As stated before, we cannot ensure that every inconsistency will be detected by the SHACL shapes generated by our rules. Figure 8.2 depicts an example of inconsistency that arises after an inference that would be done by a full OWL reasoner but not by SHACL.

Figure 8.2: Example of inconsistency detected by OWL reasoner

Since Bob eats himself, a reasoner would infer that Bob is a Human and a Tiger, while these two classes are supposed to be disjoint. Our SHACL shapes would not make any inference allowing to detect that Bob is a Human and a Tiger, and therefore would not detect the inconsistency. In order to detect such inconsistencies with SHACL shapes, it would be necessary to, either make some inference beforehand, or to derive a stronger set of SHACL rules from the ontology.

Now consider another example that the following triples already exist in bio2rdf: @prefix bio2rdf:<http://bio2rdf.org>.

The properties in black are those that are explicitly mentioned in bio2rdf: The “Protein” with Ensembl id “ENSPXXXXX” is related to the “Gene” with Ensembl id “ENSGXXXXX” and is translated from the “transcript” with Ensembl id ENSTXXXXX. According to these two given triples, a domain expert can implicitly infer the property shown in red. Now assume the following triple to received and added to current data:

The data that this triple expresses is in contrast with what is already derived from previously presented data. Therefore, to prevent “inconsistency”, all inferred data should also be considered in constraints. In this particular example, a SHACL shape may be used to restrict the cardinality of the ‘is_transcribed_from’ property to point to at most one Gene and a conjunctive constraint may be used to ensure that the ‘Transcript’ and the ‘Protein’ that was translated from it point to the very same Gene.

8.4 Evaluation and Results

In the following use-case, the Wikidata dataset is used and the following simplifications are made for the sake of readability and evaluation in a browser-based SHACL validator:

  • Wikidata entity ‘wde:Q515 for the City is referred to as ‘isw:City’

  • Wikidata entity ‘wde:Q30185’ for the Mayor is referred to as ‘isw:Mayor’

  • Wikidata entity ‘wde:Q146 for the Cat is referred to as ‘isw:Cat’

  • The Mayor is declared to be a subclass of foaf:Person and the Cat is declared to be disjoint with a foaf:Person in order to exemplify the arising logical inconsistency when the information about Stubbs is added.

  • The ‘hasMajor’ property is defined to directly link between City and Mayor class instances and the maximum cardinality restriction of 1 is defined.

@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix owl:  <http://xmlns.com/foaf/0.1/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix foaf:  <http://xmlns.com/foaf/0.1/> .
@prefix isw: <http://isws.example.com/> .
@prefix wde: <http://www.wikidata.org/entity/> .

isw:Mayor rdfs:subclassOf foaf:Person .
isw:Cat owl:disjointWith foaf:Person .
isw:hasMayor rdf:type owl:ObjectProperty ,
                      owl:FunctionalProperty ;
             rdfs:range :Mayor .

Rule 3 allows us to produce the shape with the following constraint:

    isw:MayorShape a sh:NodeShape ;
    sh:targetClass isw:Mayor ;
    sh:property [
            sh:path rdf:type ;
            sh:not [
                sh:hasValue isw:Cat
            ]
    ] .

Similarly, Rule 1 allows to derive a shape with the cardinality constraint:

isw:hasMayorShape a sh:PropertyShape ;
Ψsh:path isw:hasMayor ;
Ψsh:maxCount 1 .

Then, we validate the data prior to its insertion into the triplestore containing the knowledge base:

isw:jdoe   a           isw:Mayor, foaf:Person;
           foaf:name   "John Doe" .

isw:stubbs a           isw:Mayor, isw:Cat;
           foaf:name   "Stubbs".

isw:city1 a            isw:City;
          isw:hasMayor isw:jdoe, isw:stubbs .

The validation against the set of the derived shapes produces the following validation report:

[
    a sh:ValidationResult;
    sh:focusNode isw:city1;
    sh:resultMessage "More than 1 values";
    sh:resultPath isw:hasMayor;
    sh:resultSeverity sh:Violation;
    sh:sourceConstraintComponent sh:MaxCountConstraintComponent;
    sh:sourceShape []
].
[
    a sh:ValidationResult;
    sh:focusNode isw:stubbs;
    sh:resultMessage "Value does have shape Blank node \_:n3625";
    sh:resultPath rdf:type;
    sh:resultSeverity sh:Violation;
    sh:sourceConstraintComponent sh:NotConstraintComponent;
    sh:sourceShape [];
    sh:value isw:Cat
].

The resulting report demonstrates how a set of shapes produced at the design time solely from the TBox part of the ontology allows to prevent the insertion of the resources that would cause a logical inconsistency for the reasoning process in the triplestore. As shown in the report there are two violations of the shapes. The first violation is triggered by the unmet cardinality restriction. The second violation derives from the disjointness axiom.

8.5 Conclusion and Discussion

Areas of Linked Open Data application are expanding beyond data dumps with the TBox immediately accompanied by the corresponding ABox. Linked Open Data technologies are used to power applications that allow concurrent modification of the ABox data in real-time and require scalability to handle Big Data. In this paper, we present an approach to ensure the logical consistency of the ontology at the runtime by checking the changes to the ABox against a set of statically generated SHACL shapes. These shapes were derived from the ontology TBox using a set of formal rules.

An ideal SHACL validation system would be sound and complete as described in Section 8.3. However, since OWL and SHACL rely on the open and closed world assumption respectively, soundness and completeness are difficult to achieve simultaneously (in other words SHACL shapes derived from the ontology tend to be too strong or too weak). We chose to ensure soundness, and left completeness for future work. A continuation of the work started in this document should contain proofs that soundness is indeed achieved, and further rules for SHACL shapes automated creation should be added in order to get closer to completeness.

Future work will also be directed to the extension of the approach to support reasoning without unique name assumption. This extension will require changes in some of the proposed shapes.

9.1 Resources

In our experiments we use the semantic annotation system DBpedia Spotlight. It allows for semantic queries in order to perform a range of NLP tasks. The tools can be assessed through a web application, as well as using a web Application Programming Interface (API).

  • The DBpedia knowledge base [65] is the result of both collaborative and automated work that aims to extract of Wikipedia structured information in order to make them freely available on the Web, link them to other knowledge bases and allow them to be queried by computers [13].

  • DBpedia Spotlight [30] is a service of Named Entity Linking based on DBpedia that “looks for about 3.5M things of unknown or about 320 known types in text and tries to link them to their global unique identifiers in DBpedia.” The system uses context elements extracted from Wikipedia and keyword similarity measures to perform disambiguation. It can be downloaded and installed locally or queried with open APIs in ten languages. There are a variety of Linked Data that can be utilised to further evaluate the viability of our framework.

  • We use Open Blockchain implemented by the Knowledge Media Institute to interact with a blockchain through four sets of API: User API, Store API, Util API and IPFS API. The first set of commands provides an authentication a user and managing its account. The sets of command Store API and Util API allow for fully interaction with the blockchain, including the requests for smart contracts stored in a blockchain and their hashes, registration of a new instance of the RDF store contract. IPFS API provides an assess to an IPFS storage.

9.2 Proposed Approach

Smart-Contract Response Principle

Smart contract is an immutable self-executable code containing agreements that must be respected. For smart contracts, a set of rules is formulated as an executable code and the compliance with the rules is verified on the nodes. User have access to smart contracts. Since we deal with two different types of users (trusted and untrusted) we propose to use two different validation models in our framework. The obtained responses (decisions) can be processed in two different ways (w.r.t. the type of users) in order to get consensus-based response. We have constructed a general model that can be adapted for two different type of users.

The basis for the final decision is the majority vote. Let us consider how the majority vote model can be applied for the blockchain-based validation. On a query we get an infinite sequences of responses , (one response for one claim). The claims can either be “accepted” or “rejected”, i.e. , where 0 / 1 corresponds to ‘’reject” / “accept” responses, respectively. To take the final decision, we define the function :

(9.1)

where is a number of responses that are required for taking the final decision and is the floor function, i.e., it takes as input a real number and gives as output the greatest integer less than or equal to this number. The function takes n first responses and returns 0 / 1 in case where the final decision is to “accept” or “reject”, respectively.

Weak Validation Model

When trusted users access smart contracts we use a weak validation model (for example, we consider a model where to get a Schengen Visa it is sufficient to obtain approvement or rejection only from one country). In this model, is a fixed value since all the responses are obtained from the reliable sources. In the simplest case, where , to return the final decision the function takes the first received answer. Since responses are trusted their number is supposed to be small.

Strong Validation Model

When untrusted users access smart contracts, we require more strict approvement rules. In other words, we require strong validation for untrusted responses (for example,. when ‘citizen’ assess to smart contracts the obtained responses should be verified carefully). We assume that the most of the users are trusted (or at least more than a half). In that case, the first model can be used when n is large, i.e., to take a final decision a lot of responses are needed to be received. The weakness of the application when the model for untrusted users is the following. As the number of required responses should be large, to get the final decision can take a lot of time. We propose to use the difference-based model, where the final decision is taken when the number of “accept” or “reject” answer exceeds a chosen value, i.e., n is not fixed in advance, the number of responses that is needed to be received depends on the difference in the number of obtained “accept” and “reject” responses:

Where is an indicator function, it takes 0 / 1, when the condition in the brackets is “false” / “true”, respectively. Value is the minimal value when the difference between the number of “accept” and “reject” responses exceed the chosen threshold .

Example 4.

Let us consider how the majority vote models work in practice.

  1. Case 1: is fixed. Let , i.e., to take a final decision responses are needed to be obtained. Assume the sequence of responses is 0101110.

    Thus, the final decision is ”accept”.

  2. Case 2: is not fixed and depends on the difference of the obtained responses. Let . The and responses received are summarized in Table 9.1.

Sequence no. of responses Response Value Comments on the final decision
1 0 Q = 1, the decision cannot be taken
2 1 Q = 0, the decision cannot be taken
3 0 Q = 1, the decision cannot be taken
4 0 Q = 2, the decision can be taken, n = 4,
Table 9.1: The principle of decision making for non-fixed number of responses. Q is the difference in the number of responses.

The proposed models have the “limited-response” drawbacks. It means that in cases where only a few responses can be obtained, the response time for the final decision might be great. To avoid the time-lost problem, Q might be non-fixed in advance and a limit on maximal response time T is fixed. In that case, the requirement of the final decision can be relaxed to get the final result within the chosen dataframes.

Proof of Concept

In our proof of concept we developed an application to test the Open Blockchain [4](2018a) infrastructure and API together with a Linked Open Data dataset. A brief demo can be found here: https://hufflepuff-iswc.github.io. The application has the complete workflow needed to store the data on blockchain and link it with linked open data functionalities. The Screenshot of the application is provided in the Figure 1 and description of each step is listed below:

  • In order to store the information in the blockchain the user should create an account and register himself with his credentials. In contradiction to the standard authentication methods - the user’s credentials are stored in encrypted decentralized way in the blockchain. After successful login the user gets an authentication token, which is then used for authorization of the next requests.

  • In the next step the user has to create a new instance of the RDF store to put his data in using his authentication token. After the store is created it is put to a block and transaction number is returned back. Each transaction and block creation are visualised on top of the page.

  • For the mining of a block some time is required. The user can check the status of the block mining by requesting the block receipt.

  • Using authentication token and the transaction number the RDF store has to be registered in the blockchain. By registering the store the smart contract is created and the address of this contract is returned back to the user.

  • At any time the user can check the RDF stores, which are associated with his account.

  • In the next step the user has to load the file or data which has to be stored in the blockchain. The file/data is then automatically splitted to the validatable statements and semantic information in form of RDF triples is extracted from them.

  • Finally, the extracted RDF data is stored in a transaction in the blockchain.

The user can choose which statements he wants to validate and select the trusted authorities suggestions provided by the system. Using the semantic information and fulfilling it with the contact information from the open sources the system will inform the authorities about the validation request. The validation request with the request status is stored in the user’s profile and can independently of other information be shared with the third parties.

In order validate the stored data the trusted authority signs the verifiable information using his private key. The data and the signature are put together to the blockchain application.

By the verification of validation the organisation sends a request with the document to the system. The system retrieves the stored sentences, extracted RDF triples together with the signature of the trusted authorities and if the information could be validated successfully, puts a validation badge for each statement.

Figure 9.1: Screenshot of the proof of concept application

The prototype consists of multiple components which can be seen in the architecture overview in Figure  9.2. The user makes a HTTP request to the API, where he uploads the document that should be validated by the system (1). The Named Entity Recognition system extracts the semantic entities using natural language processing techniques (2). The entities are represented as RDF triples and combined together with information from the Linked Open Data cloud (4) put to the Open Blockchain network (5.1). In the network, the document and the RDF triples are stored in an InterPlanetary File System (IPFS) distributed file storage network and the retrieved hash is stored in the blockchain transaction(5.2) (http://ipfs.io). This information is then stored for validation.

Figure 9.2: Architecture Overview. Adapted from Domingue, J. (2018) Blockchains and Decentralised Semantic Web Pill, ISWS 2018 Summer School, Bertinoro, Italy.
Use Cases

The proposed distributed validation approach can be used in multiple different use cases.

Blockchain Dating

The first suggested use case is storing dating data on a blockchain. In this case, the semantic triples from all personal dating-relevant data (e.g, interests, age, ex-partners, etc.) are extracted, encrypted, and put in IPFS. The retrieved hashes are stored on blockchain. Permissions are defined to allow and describe how and what parts of this personal data can be used by different services or dating websites. The description of data usage and permissions is written in a separate smart contract on the blockchain that is signed with each individual dating service provider. This ensures that the owner of the data is in full control of which platform uses what parts of data and how that data is used. Validation of the user data can be done by the peers in the blockchain network that have interacted with the user. As there is no trusted authority that can officially validate all the personal information like interests or events attended, the peers will (in)validate the presented information about the user. A trust system can be used to strengthen the validation system.

Distributed Career Validation

Another possible use case is a distributed career validation system. The system should store and verify education, skill, and career information for individuals. The qualification documents are stored in distributed secure way and due to the qualities of blockchain can not be changed and will never disappear. The system saves the business resources and effort for recruiting and validation of the job applications. The authorities in this case are universities, online schools, and previous employees.

Splitting a document such as a curriculum vitae (CV) in small easily verifiable pieces of information and fulfilling the missing information using semantic inferences can help authorities, such as universities, former employers, etc., easily prove the validity of the information the candidate has provided. And the new employer can trust that the authority proved the data provided by the candidate and it was validated.

Blockchain Democracy

Blockchain-based authentication systems provide a more secure mechanism than conventional identity tools since they remove the intermediaries and as they are decentralized, the records are retrievable, even after cases of disaster. In order to achieve a successful transition between a centralized government to a decentralized one, the data in all the official databases needs to be transferred on the blockchain. Whenever new data is to be added in the blockchain, the smart contract regulates the process of validation as a governmental official will confirm or not the truthness of the data.

In the case of e-Estonia, the citizens are can identify themselves in a secure way and every transaction can be approved and stored on the blockchain. The communication between different departments of the government is shortened in time, which makes the institutions more efficient. In the case that a citizen needs a certificate from the government, they identify themselves in the system and send the request to an institution. The employees of the institution (miners) are competing for the task and the first that completes the task is rewarded inside the blockchain. As soon as the task is done, it is stored in the system and can be accessed by the citizens.

9.3 Related Work

Zyskind and others (2015) defined a protocol that turns a blockchain into an automated access-control manager without need to trust a third party. Their work use blockchain storage to construct a personal data management platform focused on privacy. However, the protocol does not use LD and has not been implemented. To the best of our knowledge, there has been no follow-up to this work.

Previous work on validating Linked Open Data with blockchains includes several researches at the Open University [5](Open BlockChain 2018b). Allan Third et al. [96], for instance, compares four approaches to Linked Data/Blockchain verification with the use of triple fragments.

Third & Domingue (2017) have implemented a semantic index to the Ethereum blockchain platform to expose distributed ledger data as LD. Their system indexes both blocks and transactions by using the BLONDiE ontology, and maps smart contracts to the Minimal Service Model ontology. Their proof of concept is presented as “a first steps towards connecting smart contracts with Semantic Web Services”. This paper as well as the previous one focuses on the technological aspects of blockchain and does not describe case studies related to privacy issues on the Web.

Sharples & Domingue [90] propose a permanent distributed record of intellectual effort and associated reputational reward, based on the blockchain. In this context, Blockchain is used as a reputation management system, both as a “proof of intellectual work” as an “intellectual currency”. This proposal, however, concerns only educational records, while ours aims is to address a wider variety of private data.

9.4 Conclusion and Discussion

In the present work, we propose a novel approach for validating LD using the Blockchain technology. We achieved this by constructing a set of rules that describes two validation models that can be encoded inside smart contracts. The advantages of using Blockchain technology with Linked Data for distributed data validation are: 1) The user maintains full control over their data and how this data is used (i.e. no third party stores any personal information), 2) Sensitive data is stored in a distributed and secure manner that minimises the risk of data loss or data theft, 3) The data is immutable and therefore a complete history of the changes can be retrieved at any time, 4) RDF stores can be used for indexing and for searching for specific triples in Linked Data; 5) Using LD, information can be enriched with semantic inferences; 6) Using smart contracts means that the validation rules on the decentralised system are reinforced forever.

However, the framework presented in the paper has a few limitations: 1) It is vulnerable to all weaknesses that the Blockchain technology suffers from (e.g. smaller networks are vulnerable to 51% attack); 2) It requires a certain degree of trust in government organisations for maintaining accurate information about the data (i.e. garbage-in-garbage-out), and 3) In our formalisation we proposed to use a time-independent smart contract consensus model (where the parameters of the function that produces the final response are fixed). The model suffers from a “time-loss” problem in time-lag cases. This model can be further improved by defining time-dependent parameters that ensure obtaining a response in the defined time-frames.

Building a decentralized system that uses blockchain technology to support the validation of LD opens up the possibility for secure data storage, control and ownership. It enables a trusted, secure, distributed data validation and share the only explicitly required information with the third parties. In the future work, we plan implement the validation and verification workflow described in our approach and to improve the limitations mentioned above.

10.1 Related Work

In the following, we present the work related to our approach. First, we describe how a variety of federated SPARQL query engines select the relevant sources in the federation to minimize the execution time. Next, we present approaches addressing data incompleteness when querying Linked Data.

Federated SPARQL query engines [87, 10, 45, 44] are able to evaluate SPARQL queries over a set of data sources. FedX [87] is a federated SPARQL query engine introduced by Schwarte et al. It performs source selection by dynamically sending ASK queries to determine relevant sources and use bind joins to reduce data transfers during query execution. Anapsid [10] is an adaptive approach for federated SPARQL query processing. It adapts query execution based on the information provided by the sources, e.g., their capabilities or the ontology used to describe datasets. Anapsid also proposes a set of novel adaptive physical operators for query processing, which are able to quickly produce answers while adapting to network conditions.

Endris et al. [35] improve the performance of federated SPARQL query processing by describing RDF data sources in form of RDF molecule templates. RDF molecule templates (RDF-MTs) describe properties associated with entities of the same class available in a remote RDF dataset. RDF-MTs are computed for a dataset accessible via a specific web service. They can be linked to the same data set or across datasets accessible via other web services. MULDER [35] is a federated SPARQL query engine that leverages these RDF-MTs in order to improve source selection and reduce query execution time while increasing the answer completeness. MULDER decomposes a query into star-shaped subqueries and associates them with the RDF-MTs to produce an efficient query execution plan.

Finally, Fedra [73] and Lilac [74] leverages replicated RDF data in the context of a federated process. They describe RDF datasets using fragments, which indicates which RDF triples can be fetched from which data source. Using this information, they compute a replication-aware source selection and decompose SPARQL queries in order to reduce redundant data transfers due to data replication.

However, neither of these approaches are able to detect data incompleteness in a federation. Furthermore, the presented source selection approaches will not be able to overcome semantic heterogeneity to improve answer completeness, as outlined in Section 1.

Acosta et al. [9] propose HARE, a hybrid SPARQL engine which is able to enhance the completeness of query answers using crowdsourcing. It uses a model to estimate the completeness of the RDF dataset. HARE can automatically identify parts of queries that yield incomplete results and retrieves missing values via microtask crowdsourcing. A microtask manager proposes questions to provide specific values to complete the missing results. Thus, HARE relies on the crowd to improve answer completeness and is not able to leverage linked RDF datasets.

We conclude that, to the best of our knowledge, no federated SPARQL query engine is able to tackle the issue of data incompleteness in the presented context.

10.2 Proposed Approach

In our work, we rely on the assumptions that the descriptions of RDF datasets are computed and provided by data providers and that Linked RDF datasets are correct but potentially incomplete. Our approach is based on three keys contributions: (1) an extension of the RDF molecule template to detect data incompleteness, (2) a cost model to determine the relevancy of a source, and (3) a physical query operator which leverages the previous contributions to enhance answer completeness during query execution. An overview of the approach is provided in Figure 10.2. The figure depicts the query processing model. The engine gets a query as the input. During query execution, the Jedi operator leverages the eRDF-MTs of the data sources in the federation to increase answer completeness. Finally, the complete answers are returned.

Figure 10.2: Overview of the approach. The figure depicts the query processing model. The engine gets a query as the input. During query execution, the Jedi operator leverages the eRDF-MTs of the data sources in the federation to increase answer completeness. Finally, the complete answers are returned.

10.3 Problem Statement

First, we formalize the problem of data incompleteness and provide the notion of an oracle as a reference point for our definition.

Given a set of RDF datasets and a SPARQL query to be evaluated over , i.e., . Consider , the oracle dataset that contains all the data about each entity in the federation. Answer completeness for , with respect to , is defined as .

The problem of evaluating a complete federated SPARQL query over F is: .

In other words, the problem is to find the minimal set of sources in to use during query execution in order to maximize answer completeness.

10.3.1 Extended RDF Molecule template

Next, to tackle the problem of detecting data incompleteness, we rely on the HARE [35] RDF completeness model. We now introduce key notions from this model that we are going to use. HARE is able to estimate that answers to a SPARQL query might be incomplete by leveraging the multiplicity of resources.

Definition 5.

Predicate Multiplicity of an RDF Resource [9] Given an RDF resource occurring in the data set , the multiplicity of the predicate for the resource , denoted , is .

Example 5.

Consider the RDF dataset from Figure 10.1. The predicate multiplicity of the predicate rdfs:label for the resource dbr:Hair is , because the resource is connected to two labels.

Next, using resource multiplicity, HARE computes the aggregated multiplicity for each RDF class in the dataset.

Definition 6.

Aggregated Predicate Multiplicity of a Class [2] For each class occurring in the RDF data set , the aggregated multiplicity of over the predicate , denoted , is: where: corresponds to the triple , which means that the subject belongs to the class , and is an aggregation function.

Example 6.

Consider again the RDF dataset from Figure 10.1, and an aggregation function that computes the median. The aggregated predicate multiplicity of the class dbo:film over the predicate rdfs:label is .

However, HARE’s completeness model is not designed to be used in a federated scenario, as it can only be computed on a single dataset. To address this issue, we introduce a novel source description, called extended RDF Molecule template (eRDF-MT), based on RDF-MTs [9]. An eRDF-MT, defined in Definition 3, describes each dataset of the federation as the set of properties that are associated with each RDF class. It also performs the interlinking of RDF class between datasets, to be able to find equivalent entities across the federation. Finally, eRDF-MTs also capture the equivalence between properties, in order to capture the semantic heterogeneity of datasets.

Definition 7.

Extended RDF Molecule Template (eRDF-MT) An Extended RDF Molecule Template is a 7-tuple = ¡W, C, f, DTP, IntraC, InterC, InterP¿ where:

  • W is a Web service API that provides access to an RDF dataset G via SPARQL protocol;

  • C is an RDF class such that the triple pattern (?s rdf:type C) is true in G;

  • f is an aggregation function;

  • DTP is a set of pairs (p, T, f(p)) such that p is a property with domain C and range T, and the triple patterns (?s p ?o), (?o rdf:type T) and (?s rdf:type C) are true in G. f(p) is the aggregated multiplicity of predicate p for class C;

  • IntraC is a set of pairs (p, Cj ) such that p is an object property with domain C and range Cj, and the triple patterns (?s p ?o) and (?o rdf:type Cj) and (?s rdf:type C) are true in G;

  • InterC is a set of 3-tuples (p, Ck, SW) such that p is an object property with domain C and range Ck; SW is a Web service API that provides access to an RDF dataset K, and the triple patterns (?s p ?o) and (?s rdf:type C) are true in G, and the triple pattern (?o rdf:type Ck) is true in K.

  • InterP is a set of 3-tuples (p, p’, SW) such that p is a property with domain C and range T, SW is a Web service API that provides access to an RDF dataset K and p’ is a property with domain C’ and range T’ such as the triples (p owl:sameAs p’) or (p’ owl:sameAs p) exists in G or K.

The idea is to estimate the expected cardinalities of each property for each class in the data set. Thus, if the query engine finds fewer results for an entity of that class and a property than estimated by the eRDF-MT, it would consider the results to be incomplete. In this case, we assume that connected datasets in the eRDF-MT can be used to complete the missing values. Figure  10.3 provides an example of two eRDF-MTs.

Figure 10.3: An example of two interlinked eRDF-MT for the data sources LinkedMDB (left) and DBpedia (right). InterC and InterP provide links between the classes and properties in the different data sources. Additionally, the aggregated multiplicity of each predicate is displayed next to the predicates.
10.3.2 The Jedi Cost model

We introduce a cost-model which relies on the eRDF-MTs to detect RDF datasets that can be used to complete query results, and estimate the relevance of these RDF datasets. This cost-model aims to solve our research problem, by selecting the minimal number of sources to contact. First, we formalize in Definitions 4 and 5 how to compute the relevant eRDF-MT that can be used to enhance the results when evaluating a given triple pattern in the federation.

Definition 8.

Given a triple pattern , a root eRDF-MT and a set of eRDF-MTs , where summarizes the dataset . The set of relevant eRDF-MTs for tp and r are defined as , such as there exists and there exists .

In other word, an eRDF-MT is considered to be relevant with respect to the root eRDF-MT, if it contains the same class (potentially with a different identifier) and the class has the same predicate (potentially also with a different identifier) as the triple tp.

Definition 9.

Relevance of eRDF-MT Given a triple pattern and a eRDF-MT , the relevance of m for tp is if there exists a in DTP.

Using these relevant eRDF-MTs, we next devise a strategy to minimize the number of relevant sources to select by ranking sources according to their relevance, formalized in the following definition.

Definition 10.

Ranking relevant eRDF-MTs Given a triple pattern , a root eRDF-MT and the set of relevant eRDF-MTs . The ranking of is where eRDF-MTs are sorted by descending relevance.

The Jedi operator for Triple Pattern evaluation

Federated SPARQL query engine evaluates SPARQL query for building a plan of physical query operators [44]. We choose to implement our approach as a physical query operator for triple pattern evaluation, named Jedi operator, in order to ease the integration of this operator in an existing federated SPARQL query engine. Thus, it can be used with state of the art physical operator, like Symmetric Hash Join [45] or Bind Join [87, 51], to handle query execution.

The Jedi operator follows interlinking between eRDF-MTs using a breadth-first approach to find additional data during query execution. The algorithm of the operator is shown in Figure  10.4. The inputs are a triple pattern, a root eRDF-MT (from which the computation will start) and a set of eRDF-MTs for the data sources in the federation. Starting with the root eRDF-MT, the Jedi operator first evaluates the triple pattern at the associated data source (Line 1-6). Then, if the results are incomplete according to the aggregated multiplicity, it uses the Jedi cost-model to find relevant datasets to use (Lines 7-8) and selects the more relevant one to continue query execution (Line 14). Next, it performs a triple pattern mapping (Line 12) using the property interlinks of eRDF-MTs, to maps the triple pattern to the schema used by the newly found dataset. The operator terminates if there the results are considered complete regarding the expected aggregated multiplicity, or if no more relevant eRDF-MTs to use to improve answer completeness.

Figure 10.4: The Jedi operator algorithm evaluates a triple pattern using eRDF-MTs

10.4 Evaluation and Results

In the evaluation, we consider five queries evaluate over two data sets in order to determine the impact of our approach on answer completeness. Each query is associated with a certain domain in order to show that completeness issues are distributed over different parts of the data. The federation contains the data sets DBpedia and Wikidata and we assume Wikidata as a mirror data set of DBpedia. This means that, according to our cost-model, Wikidata is queried only in case the results from DBpedia are estimated to be incomplete. The original queries and the rewritten queries are provided in Appendix A of this work.

For the sake of brevity, we discuss how the evaluation query q1 in the following. In the query, we want to determine the position, date of birth and the team for soccer players. When evaluating the query over DBpedia, we retrieve no results. However, the results are incomplete when considering Wikidata as well. Rewriting the query according to our proposed approach and executing it over the federation of both data sets, we find that there are 42 results. As shown in Table 1, similar results can be observed for the other queries as well. The results of this first evaluation clearly indicates the potential of our approach to increase answer completeness over a federation of data sets. We expect similar results in other domains and for other data sets as well.

Domain Query DBpedia DBpedia + Wikidata
Sport q1 0 42
Movies q2 3 6
Culture q3 0 31
Drugs q4 0 482
Life Sciences q5 0 9
Table 10.1: Results of our preliminary evaluation. The table shows the number of answers for 5 queries evaluated over the data set DBpedia and the corresponding rewritten queries evaluated over the federation of DBpedia and Wikidata.

10.5 Conclusion and Discussion

In this paper, we proposed Jedi, a new adaptive approach for federated SPARQL query processing, which is able to estimate data incompleteness and uses links between classes and properties in different RDF datasets to improve answer completeness. It relies on extended RDF Molecule Templates, which describe the classes, properties as well as the links between data sources. Furthermore, by including the aggregated predicate multiplicity of entities, they allow for detecting incompleteness during query execution. Using these RDF-MTs and a cost-model, the Jedi operator is able to discover new data sources to improve answer completeness.

The results of our evaluation shows that answer incompleteness is presented in various domains of the well-known data sets DBpedia. Furthermore, we show that using our approach to rewritten according to the presented approach will increase the completeness of the results.

Our approach suffers from one main limitation: it assumes that eRDF-MTs are pre-computed and published by data providers. We also suppose that data providers are aware of the interlinking between their datasets. One perspective is to research how these eRDF-MTs can be computed by data consumers instead, in order to reduce the dependence on data providers.

In the future, we also aim to integrate the Jedi operator in a state-of-the-art federated SPARQL query engines, like FedX [87], MULDER [9] or Anapsid [10], in order to conduct a more elaborate experimental study of our approach. According to this study, we will then improve on our approach to maximize the answer completeness.

Bibliography

  • [1] Stanford 40 Actions. http://vision.stanford.edu/Datasets/40actions.html, [Online; accessed 6-July-2018]
  • [2] UCF101: a Dataset of 101 Human Actions Classes From Videos in The Wild. http://crcv.ucf.edu/data/UCF101.php, [Online; accessed 19-July-2008]
  • [3] Validity. (2018). In OxfordDictionaries.com. https://en.oxforddictionaries.com/definition/validity, [July 6, 2018]
  • [4] Open BlockChain (2018a) Decentralizing the Semantic Web via BlockChains. https://blockchain7.kmi.open.ac.uk/rdf/ (2018)
  • [5] Open BlockChain. (2018b). Researching the Potential of BlockChains. http://blockchain.open.ac.uk (2018)
  • [6] Technopedia. Data Validation. https://www.techopedia.com/definition/10283/data-validation (2018), [July 6, 2018]
  • [7] The Week (2018). Who are the Windrush generation and how did the scandal unfold? http://www.theweek.co.uk/92944/who-are-the-windrush-generation-and-why-are-they-facing-deportation (2018), [June 18, 2018]
  • [8] TopQuadrant, Inc. (2018). From OWL to SHACL in an automated way - TopQuadrant, Inc. https://www.topquadrant.com/2018/05/01/from-owl-to-shacl-in-an-automated-way/ (2018), [Online; accessed 6-July-2018]
  • [9] Acosta, M., Simperl, E., Flöck, F., Vidal, M.E.: Enhancing answer completeness of sparql queries via crowdsourcing. Web Semantics: Science, Services and Agents on the World Wide Web 45, 41–62 (2017)
  • [10] Acosta, M., Vidal, M.E., Lampo, T., Castillo, J., Ruckhaus, E.: Anapsid: an adaptive query processing engine for sparql endpoints. In: International Semantic Web Conference. pp. 18–34. Springer (2011)
  • [11] Akman, V., Surav, M.: Steps toward formalizing context. AI magazine 17(3),  55 (1996)
  • [12] Asprino, L., Basile, V., Ciancarini, P., Presutti, V.: Empirical analysis of foundational distinctions in the web of data. CoRR abs/1803.09840 (2018), http://arxiv.org/abs/1803.09840
  • [13] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. In: The semantic web, pp. 722–735. Springer (2007)
  • [14] Bayoudhi, L., Sassi, N., Jaziri, W.: How to repair inconsistency in owl 2 dl ontology versions? Data & Knowledge Engineering (2018)
  • [15] Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in neural information processing systems. pp. 585–591 (2002)
  • [16] Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P., Morissette, J.: Bio2rdf: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics 41(5), 706–716 (2008)
  • [17] Berners-Lee, T.: Linked Data. 2006. http://www.w3.org/DesignIssues/LinkedData.html, [July 6, 2018]
  • [18] Bhatia, S., Dwivedi, P., Kaur, A.: Tell me why is it so? explaining knowledge graph relationships by finding descriptive support passages. arXiv preprint arXiv:1803.06555 (2018)
  • [19] Bhatia, S., Vishwakarma, H.: Know thy neighbors, and more! studying the role of context in entity recommendation (2018)
  • [20] Bird, S., Klein, E., Loper, E.: Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.” (2009)
  • [21] Bizer, C., Cyganiak, R.: Quality-driven information filtering using the wiqa policy framework. Web Semantics: Science, Services and Agents on the World Wide Web 7(1), 1–10 (2009)
  • [22] Blei, D.M.: Probabilistic topic models. Communications of the ACM 55(4), 77–84 (2012)
  • [23] Bozzato, L., Homola, M., Serafini, L.: Context on the semantic web: Why and how. ARCOE-12 p. 11 (2012)
  • [24] Bühmann, L., Lehmann, J., Westphal, P.: Dl-learner - A framework for inductive learning on the semantic web. J. Web Semant. 39, 15–24 (2016)
  • [25] Cai, H., Zheng, V.W., Chang, K.: A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Transactions on Knowledge and Data Engineering (2018)
  • [26] Ceolin, D., Maccatrozzo, V., Aroyo, L., De-Nies, T.: Linking trust to data quality. In: 4th International Workshop on Methods for Establishing Trust of (Open) Data (2015)
  • [27] Ceolin, D., Van Hage, W.R., Fokkink, W., Schreiber, G.: Estimating uncertainty of categorical web data. In: URSW. pp. 15–26. Citeseer (2011)
  • [28] Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Global RDF vector space embeddings. In: International Semantic Web Conference (1). Lecture Notes in Computer Science, vol. 10587, pp. 190–207. Springer (2017)
  • [29] Couto, R., Ribeiro, A.N., Campos, J.C.: Application of ontologies in identifying requirements patterns in use cases. arXiv preprint arXiv:1404.0850 (2014)
  • [30] Daiber, J., Jakob, M., Hokamp, C., Mendes, P.N.: Improving efficiency and accuracy in multilingual entity extraction. In: Proceedings of the 9th International Conference on Semantic Systems. pp. 121–124. ACM (2013)
  • [31] d’Amato, C., Staab, S., Tettamanzi, A.G., Minh, T.D., Gandon, F.: Ontology enrichment by discovering multi-relational association rules from ontological knowledge bases. In: Proceedings of the 31st Annual ACM Symposium on Applied Computing. pp. 333–338. ACM (2016)
  • [32] Dennis, M., Van Deemter, K., Dell’Aglio, D., Pan, J.Z.: Computing authoring tests from competency questions: Experimental validation. In: International Semantic Web Conference. pp. 243–259. Springer (2017)
  • [33] Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 601–610. ACM (2014)
  • [34] d’Amato, C., Tettamanzi, A.G., Minh, T.D.: Evolutionary discovery of multi-relational association rules from ontological knowledge bases. In: European Knowledge Acquisition Workshop. pp. 113–128. Springer (2016)
  • [35] Endris, K.M., Galkin, M., Lytra, I., Mami, M.N., Vidal, M.E., Auer, S.: Mulder: querying the linked data web by bridging rdf molecule templates. In: International Conference on Database and Expert Systems Applications. pp. 3–18. Springer (2017)
  • [36] van Erp, M., Hensel, R., Ceolin, D., van der Meij, M.: Georeferencing animal specimen datasets. Transactions in GIS 19(4), 563–581 (2015)
  • [37] Fanizzi, N., d’Amato, C., Esposito, F.: DL-FOIL concept learning in description logics. In: ILP. Lecture Notes in Computer Science, vol. 5194, pp. 107–121. Springer (2008)
  • [38] Fischer, P.M., Lausen, G., Schätzle, A., Schmidt, M.: RDF constraint checking. In: EDBT/ICDT Workshops. CEUR Workshop Proceedings, vol. 1330, pp. 205–212. CEUR-WS.org (2015)
  • [39] Fouss, F., Pirotte, A., Renders, J.M., Saerens, M.: Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on knowledge and data engineering 19(3), 355–369 (2007)
  • [40] Galárraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: Fast rule mining in ontological knowledge bases with amie $$+ $$+. The VLDB Journal—The International Journal on Very Large Data Bases 24(6), 707–730 (2015)
  • [41] Galárraga, L.A., Teflioudi, C., Hose, K., Suchanek, F.: Amie: association rule mining under incomplete evidence in ontological knowledge bases. In: Proceedings of the 22nd international conference on World Wide Web. pp. 413–422. ACM (2013)
  • [42] Gangemi, A., Guarino, N., Masolo, C., Oltramari, A.: Sweetening wordnet with dolce. AI Mag. 24(3), 13–24 (Sep 2003), http://dl.acm.org/citation.cfm?id=958671.958673
  • [43] Gayo, J.E.L.: Linked data validation and quality
  • [44] Görlitz, O., Staab, S.: Federated data management and query optimization for linked open data. In: New Directions in Web Data Management 1, pp. 109–137. Springer (2011)
  • [45] Görlitz, O., Staab, S.: Splendid: Sparql endpoint federation exploiting void descriptions. In: Proceedings of the Second International Conference on Consuming Linked Data-Volume 782. pp. 13–24. CEUR-WS. org (2011)
  • [46] Group, W.O.W., et al.: OWL 2 web ontology language document overview (2009)
  • [47] Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 855–864. ACM (2016)
  • [48] Grover, C., Tobin, R., Byrne, K., Woollard, M., Reid, J., Dunn, S., Ball, J.: Use of the edinburgh geoparser for georeferencing digitized historical collections. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 368(1925), 3875–3889 (2010)
  • [49] Grüninger, M., Fox, M.S.: The role of competency questions in enterprise engineering. In: Benchmarking—Theory and practice, pp. 22–31. Springer (1995)
  • [50] Guha, R., McCool, R., Fikes, R.: Contexts for the semantic web. In: International Semantic Web Conference. pp. 32–46. Springer (2004)
  • [51] Haas, L., Kossmann, D., Wimmers, E., Yang, J.: Optimizing queries across diverse data sources (1997)
  • [52] Hartig, O., Zhao, J.: Publishing and consuming provenance metadata on the web of linked data. In: International Provenance and Annotation Workshop. pp. 78–90. Springer (2010)
  • [53] Hofer, P., Neururer, S., Helga Hauffe, T., Zeilner, A., Göbel, G.: Semi-automated evaluation of biomedical ontologies for the biobanking domain based on competency questions. Studies in Health Tech. and Informatics 212, 65–72 (2015)
  • [54] Homola, M., Serafini, L., Tamilin, A.: Modeling contextualized knowledge. In: Procs. of the 2nd Workshop on Context, Information and Ontologies (CIAO 2010). vol. 626 (2010)
  • [55] Jacobson, I.: Object-oriented software engineering: a use case driven approach. Pearson Education India (1993)
  • [56] Jansen, B.: Context: A real problem for large and shareable knowledge bases. Building/Sharing Very Large Knowledge Bases (KBKS’93), Tokyo (1993)
  • [57] Khosrow-Pour, M.: Encyclopedia of information science and technology. IGI Global (2005)
  • [58] Khriyenko, O., Terziyan, V.: A framework for context-sensitive metadata description. International Journal of Metadata, Semantics and Ontologies 1(2), 154–164 (2006)
  • [59] Kontokostas, D., Westphal, P., Auer, S., Hellmann, S., Lehmann, J., Cornelissen, R., Zaveri, A.: Test-driven evaluation of linked data quality. In: Proceedings of the 23rd international conference on World Wide Web. pp. 747–758. ACM (2014)
  • [60] Kunze, L., Tenorth, M., Beetz, M.: Putting people’s common sense into knowledge bases of household robots. In: Dillmann, R., Beyerer, J., Hanebeck, U.D., Schultz, T. (eds.) KI 2010: Advances in Artificial Intelligence. pp. 151–159. Springer Berlin Heidelberg, Berlin, Heidelberg (2010)
  • [61] Lalithsena, S., Kapanipathi, P., Sheth, A.: Harnessing relationships for domain-specific subgraph extraction: A recommendation use case. In: Big Data (Big Data), 2016 IEEE International Conference on. pp. 706–715. IEEE (2016)
  • [62] Lalithsena, S., Perera, S., Kapanipathi, P., Sheth, A.: Domain-specific hierarchical subgraph extraction: A recommendation use case. In: Big Data (Big Data), 2017 IEEE International Conference on. pp. 666–675. IEEE (2017)
  • [63] Lehmann, J.: Dl-learner: Learning concepts in description logics. Journal of Machine Learning Research 10, 2639–2642 (2009)
  • [64] Lehmann, J., Bühmann, L.: Ore-a tool for repairing and enriching knowledge bases. In: International Semantic Web Conference. pp. 177–193. Springer (2010)
  • [65] Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Van Kleef, P., Auer, S., et al.: Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6(2), 167–195 (2015)
  • [66] Lewin, T.: Dean at M.I. T. Resigns, Ending a 28-Year Lie. https://www.nytimes.com/2007/04/27/us/27mit.html (2007), [July 6, 2018]
  • [67] Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., Gómez-Berbís, J.M.: Named entity recognition: fallacies, challenges and opportunities. Computer Standards & Interfaces 35(5), 482–489 (2013)
  • [68] McCallum, A.: Information extraction: Distilling structured data from unstructured text. Queue 3(9),  4 (2005)
  • [69] Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: Dbpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th international conference on semantic systems. pp. 1–8. ACM (2011)
  • [70] Mifflin, H.: The american heritage dictionary of the english language. New York (2000)
  • [71] Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)
  • [72] Missier, P., Belhajjame, K., Cheney, J.: The w3c prov family of specifications for modelling provenance metadata. In: Proceedings of the 16th International Conference on Extending Database Technology. pp. 773–776. ACM (2013)
  • [73] Montoya, G., Skaf-Molli, H., Molli, P., Vidal, M.E.: Federated sparql queries processing with replicated fragments. In: International Semantic Web Conference. pp. 36–51. Springer (2015)
  • [74] Montoya, G., Skaf-Molli, H., Molli, P., Vidal, M.E.: Decomposing federated queries in presence of replicated fragments. Web Semantics: Science, Services and Agents on the World Wide Web 42, 1–18 (2017)
  • [75] Nentwig, M., Hartung, M., Ngonga Ngomo, A.C., Rahm, E.: A survey of current link discovery frameworks. Semantic Web 8(3), 419–436 (2017)
  • [76] Patel-Schneider, P.F.: Using description logics for rdf constraint checking and closed-world recognition. In: AAAI. pp. 247–253 (2015)
  • [77] Pepitone, J.: Yahoo confirms CEO is out after resume scandal. http://money.cnn.com/2012/05/13/technology/yahoo-ceo-out/index.htm (2012), [July 6, 2018]
  • [78]

    Perozzi, B., Akoglu, L., Iglesias Sánchez, P., Müller, E.: Focused clustering and outlier detection in large attributed graphs. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 1346–1355. ACM (2014)

  • [79] Pilkington, M.: Blockchain technology: principles and applications. research handbook on digital transformations, edited by f. xavier olleros and majlinda zhegu (2016)
  • [80] Ren, Y., Parvizi, A., Mellish, C., Pan, J.Z., Van Deemter, K., Stevens, R.: Towards competency question-driven ontology authoring. In: European Semantic Web Conference. pp. 752–767. Springer (2014)
  • [81] Ristoski, P., Paulheim, H.: Rdf2vec: RDF graph embeddings for data mining. In: The Semantic Web - ISWC 2016 - 15th International Semantic Web Conference, Kobe, Japan, October 17-21, 2016, Proceedings, Part I. pp. 498–514 (2016). https://doi.org/10.1007/978-3-319-46523-4_30, https://doi.org/10.1007/978-3-319-46523-4_30
  • [82] Rizzo, G., d’Amato, C., Fanizzi, N., Esposito, F.: Tree-based models for inductive classification on the web of data. J. Web Semant. 45, 1–22 (2017)
  • [83] Rocha, O.R., Vagliano, I., Figueroa, C., Cairo, F., Futia, G., Licciardi, C.A., Marengo, M., Morando, F.: Semantic annotation and classification in practice. IT Professional (2), 33–39 (2015)
  • [84] Rosenberg, M., C.N.C.C.: How Trump Consultants Exploited the Facebook Data of Millions.The New York Times,. https://www.nytimes.com/2018/03/17/us/politics/cambridge-analytica-trump-campaign.html (2018), [March 17, 2018]
  • [85] Rula, A., Zaveri, A.: Methodology for assessment of linked data quality. In: LDQ@ SEMANTICS (2014)
  • [86] Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: International Semantic Web Conference. pp. 245–260. Springer (2014)
  • [87] Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: Fedx: Optimization techniques for federated query processing on linked data. In: International Semantic Web Conference. pp. 601–616. Springer (2011)
  • [88] Serafini, L., Homola, M.: Contextual representation and reasoning with description logics. In: 24th International Workshop on Description Logics. p. 378 (2011)
  • [89] Serafini, L., Homola, M.: Contextualized knowledge repositories for the semantic web. Web Semantics: Science, Services and Agents on the World Wide Web 12, 64–87 (2012)
  • [90] Sharples, M., Domingue, J.: The blockchain and kudos: A distributed system for educational record, reputation and reward. In: European Conference on Technology Enhanced Learning. pp. 490–496. Springer (2016)
  • [91] Shen, W., Wang, J., Luo, P., Wang, M.: Linden: linking named entities with knowledge base via semantic knowledge. In: Proceedings of the 21st international conference on World Wide Web. pp. 449–458. ACM (2012)
  • [92] Silva, V.S., Freitas, A., Handschuh, S.: Word tagging with foundational ontology classes: Extending the wordnet-dolce mapping to verbs. In: 20th International Conference on Knowledge Engineering and Knowledge Management - Volume 10024. pp. 593–605. EKAW 2016, Springer-Verlag New York, Inc., New York, NY, USA (2016), https://doi.org/10.1007/978-3-319-49004-5_38
  • [93] Sporny, M., D.L.: Verifiable Claims Data Model and Representation. W3C First Public Working Draft 03 August 2017. https://www.w3.org/TR/verifiable-claims-data-model/ (2018), [July 4, 2018]
  • [94] Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A large ontology from wikipedia and wordnet. Web Semantics: Science, Services and Agents on the World Wide Web 6(3), 203–217 (2008)
  • [95] Tao, J., Sirin, E., Bao, J., McGuinness, D.L.: Integrity constraints in owl. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence. pp. 1443–1448. AAAI’10, AAAI Press (2010), http://dl.acm.org/citation.cfm?id=2898607.2898837
  • [96] Third, A., Domingue, J.: Linkchains: Exploring the space of decentralised trustworthy linked data (2017)
  • [97] Today., I.: Narendra Modi degree row: DU college says it has no data of students passing out in 1978. https://www.indiatoday.in/india/story/narendra-modi-degree-controversy-delhi-university-rti-965536-2017-03-14 (2017), [July 6, 2018]
  • [98] Trinh, T.H., Le, Q.V.: A simple method for commonsense reasoning abs/1806.02847 (2018), http://arXiv/abs/1806.02847
  • [99] Udrea, O., Recupero, D.R., Subrahmanian, V.: Annotated rdf. ACM Transactions on Computational Logic (TOCL) 11(2),  10 (2010)
  • [100] Vasardani, M., Winter, S., Richter, K.F.: Locating place names from place descriptions. International Journal of Geographical Information Science 27(12), 2509–2532 (2013)
  • [101] Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Communications of the ACM 57(10), 78–85 (2014)
  • [102] Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd international conference on Machine learning. pp. 977–984. ACM (2006)
  • [103] Yao, L., Zhang, Y., Wei, B., Jin, Z., Zhang, R., Zhang, Y., Chen, Q.: Incorporating knowledge graph embeddings into topic modeling. In: AAAI. pp. 3119–3126 (2017)
  • [104] Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: A survey. Semantic Web 7(1), 63–93 (2016)