Rule Applicability on RDF Triplestore Schemas

07/02/2019 ∙ by Paolo Pareti, et al. ∙ 0

Rule-based systems play a critical role in health and safety, where policies created by experts are usually formalised as rules. When dealing with increasingly large and dynamic sources of data, as in the case of Internet of Things (IoT) applications, it becomes important not only to efficiently apply rules, but also to reason about their applicability on datasets confined by a certain schema. In this paper we define the notion of a triplestore schema which models a set of RDF graphs. Given a set of rules and such a schema as input we propose a method to determine rule applicability and produce output schemas. Output schemas model the graphs that would be obtained by running the rules on the graph models of the input schema. We present two approaches: one based on computing a canonical (critical) instance of the schema, and a novel approach based on query rewriting. We provide theoretical, complexity and evaluation results that show the superior efficiency of our rewriting approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inference rules are a common tool in many areas where they are used, for example, to model access control policies [3] and business rules [10]. In this paper we are motivated by their use in Internet of Things (IoT) applications, where rules are often used to capture human decision making in a simple and straightforward way [21]. This is especially true in safety-critical domains, such as in the field of Occupational Health and Safety (OHS). OHS knowledge is codified by experts into policies, which are then translated into rules to monitor and regulate workplaces. For example, OHS regulations set limits on human exposure to certain gases. Monitoring systems can use these rules to determine, from sensor data, whether dangerous gas concentration levels have been reached, and trigger warnings or perform actions such as increasing ventilation. Real-world use cases have been provided by industrial partners, such as a wireless and electronic solutions company, and OHS policies from the International Labour Organisation [16].

An important limitation of current inference rule applications in the IoT domain is that they require expert human interventions not only to create rules, but also to manage them. This includes determining when they are applicable and what types of facts we can ultimately infer. In the gas-concentration example above, human experts would be employed to answer questions such as: could the rule that aggregates sensor data be used, maybe in conjunction with other rules, to determine whether an area should be evacuated? Is this rule applicable to the available data sources? And will this rule still be applicable after the underlying data sources change (e.g., in case sensors stop working or are substituted with others)? Knowing which rules can be applied and what type of facts can be inferred on a dataset can have safety critical implications, as OHS policies might depend on the availability of certain pieces of information. It is important to note that by executing rules on a specific dataset we only discover the facts that are currently entailed. To predict which facts could potentially be inferred in future versions of the dataset, we need to reason about its schema.

As IoT scenarios become increasingly complex and dynamic, managing rules in a timely and cost effective way requires improvements in automation. In this paper we present an approach that can answer these questions automatically, by reasoning about an abstraction of the available data sources, called the triplestore schema. We define triplestore schemas as abstract signatures of underlying data, similar to database schemas. Triplestore schemas can be defined by experts or, as we will see later, can be derived from the types of sensor available. Such schemas are dynamic, changing as the data sources change; e.g., a new sensor is added to the system creating triples with new predicates, and entailing other facts.

We consider RDF [8] triplestores and we model triplestore schemas as sets of SPARQL [13] triple patterns, which in some formal sense restrict or model the underlying RDF data. We express rules as SPARQL construct queries. This type of rules model SPIN [17] inference rules, which correspond to Datalog rules [6], and are also compatible with the monotonic subsets of other rule languages, such as SWRL [15]. Given an input triplestore schema and a set of rules, our objective is to decide whether the rules would apply on some RDF dataset modelled by this schema. We do so by computing the “output” or “consequence” schema of these hypothetical rule applications: this is the schema that models all possible RDF datasets that can be obtained by executing the rules on all datasets modeled by the input schema. It is worth noting that our approach is only concerned with schemas, and it is compatible with relational datasets, as long as their schema and and rules can be expressed using RDF and SPARQL [5].

Reasoning at the schema level has been explored previously for databases [20] and Description Logics [11]. In fact, for a different but closely related problem of reasoning on database schemas (called chase termination), Marnette [20] employed a canonical database instance, called the critical instance, which is representative of all database instances of the given schema, on which we base one of our solutions.

We propose two approaches to reason about the applicability of inference rules on triplestore schemas. First, we re-use the critical instance for our triplestore schemas and develop an approach based on this representative RDF graph: running the rules on this graph produces evaluation mappings which, after careful manipulation in order to account for peculiarities of RDF literals, help produce our consequence schemas. When constructing the critical instance, as in the original case of relational databases, we need to place all constants appearing in our schema and rules in the constructed instance in many possible ways. This leads to a blowup in its size and so we turn our attention to finding a much smaller representative RDF graph, that we call the sandbox graph, and which we populate with only one “representative” element. We then develop a novel query rewriting algorithm that can compute the consequence schema on the sandbox graph. We provide correctness, complexity, and evaluation results and experimentally exhibit the efficiency and scalability of our rewriting-based approach: it surpasses the critical-instance methods by orders of magnitude while scaling to hundreds of rules and schema triples in times ranging from milliseconds to seconds.

2 Background

We consider triplestores containing a single RDF graph, without blank nodes. Such a graph is a set of triples where is the set of all URIs, the set of all literals and the set of all variables. We use the term constants to refer to both literals and URIs. A graph pattern is a set of triple patterns defined in: . Given a pattern , and are the sets of variables and constants in the elements of , respectively. We represent URIs as namespace-prefixed strings of characters, where a namespace prefix is a sequence of zero or more characters followed by a column e.g. ; literals as strings of characters enclosed in double-quotes, e.g. “l”, and variables as strings of characters prefixed by a question-mark, e.g. ?v. The first, second and third elements of a triple are called, respectively, subject, predicate and object, and are denoted by , with denoting throughout the paper indexes .

A variable substitution is a partial function . A mapping is a variable substitution defined as . Given a mapping , if , then we say contains binding . The domain of a mapping is the set of variables . Given a triple or a graph pattern and a variable substitution we abuse notation and denote by the pattern generated by substituting every occurrence of a variable in with if (otherwise remains unchanged in ).

Given a graph pattern and a graph , the SPARQL evaluation of over , denoted with , is a set of mappings as defined in [22]. A graph pattern matches a graph if its evaluation on the graph returns a non-empty set of mappings. We consider inference rules , where and are graph patterns, and can be expressed as SPARQL construct queries. Note that essentially both and in a rule are conjunctive queries [1]. The consequent of the rule is represented in the construct clause of the query, which is instantiated using the bindings obtained by evaluating the antecedent , expressed in the where clause. A single application of a rule to a dataset , denoted by , is . Rule notations such as SPIN and SWRL can be represented in this format [2]. The closure, or saturation, of a dataset under a set of inference rules , denoted by , is the unique dataset obtained by repeatedly applying all the rules in until no new statement is inferred, that is, , with , and .

3 Problem Description

To reason about schemas we need a simple language to model, and at the same time restrict, the type of triples that an RDF graph can contain. It should be noted that, despite the similarity in the name, the existing RDF schema (RDFS) vocabulary is used to describe ontological properties of the data, but not designed to restrict the type of triples allowed in the dataset. In this paper we define a triplestore schema (or just schema) as a pair , where is a set of triple patterns, and is a subset of the variables in which we call the no-literal set. Intuitively, defines the type of triples allowed in a database, where variables act as wildcards, which can be instantiated with any constant element.

To account for the restrictions imposed by the RDF data model, the no-literal set defines which variables cannot be instantiated with literals, thus must at least include all variables that occur in the subject or predicate position in . For example, if and , then the instances of schema can contain any triple that has as a predicate. If and , the instances of can contain any triple that has as a subject, as a predicate, and a URI as an object. To prevent the occurrence of complex interdependencies between variables, we restrict each variable to occur only once (both across triples, and within each triple) in and in the rule consequents.

A graph is an instance of a schema if for every triple in there exists a triple pattern in , and a mapping such that (1) and (2) does not bind any variable in to a literal. In this case we say that models graph (and that each triple models triple ). All instances of are denoted by . We say that two schemas and are semantically equivalent if they model the same set of instances (formally, if ). Notice that any subset of an instance of a schema is still an instance of that schema. A rule within a set of rules is applicable with respect to a triplestore schema if there exists a graph instance of , such that the precondition of matches .

Consider the following example scenario of a mine where sensors produce data modelled according to the Semantic Sensor Network Ontology (SSN) [19], with namespace sosa. In SNN, sensor measurements are called observations. A simple approach to create the schema of a dataset is the following. A dataset that can be populated with sensor measurements of a property (e.g., temperature) can be defined with triple pattern . Pattern indicates that the results of these measurements are collected and pattern indicates that the results are applicable to a specific entity (e.g., a room). Similar patterns are presented in [7] in the context of converting CSV data into the SNN format . In this example, the sensors are deployed only in one tunnel, namely , and schema is:

 = ,
      ,
      ,
      
 = 

We now consider instance of schema . In this instance, the sensors in tunnel A observed both a dangerous gas concentration, represented by the value “1”, and the presence of worker .

 = ,
      ,
      ,
      ,
      ,
      

Consider two rules and . The first one detects when workers trespass on an “off-limit” area, and the second one labels areas with dangerous gas concentrations as “off-limit”.

 = ,
      ,
      ,
      
       
 = ,
      ,
      
       

Since the precondition of rule matches dataset , we can apply the rule and derive a new fact: . On the instance extended by this new fact, rule is applicable and adds .

Our approach relies on being able to decide which rules are applicable on a specific triple store schema, e.g., , in absence of any particular instance, e.g., . Since the precondition of rule matches dataset , this rule is directly applicable on schema , and we would like to be able to decide this by only looking at . Moreover if we can decide this and extend schema with a triple pattern that is the schema of (in this case that schema would be the same triple itself), then we would able to reason with this new schema and decide that rule is also applicable. In practice, what we would like to do is to compute a schema that captures all consequences of applying our set of rules on any potential instance.

The following definition captures this intuition. Given a schema and a set of rules , a schema is a schema consequence of with respect to , denoted , if }. We can notice that since every subset of an instance of a schema is still an instance of that schema, a dataset can contain the consequence of a rule application without containing a set of triples matching the antecedent. This situation is commonly encountered when some triples are deleted after an inference is made.

Keeping track of the schema consequences allows us to directly see which rules are applicable to instances of a schema without running the rules on the data. In correspondence to a single rule application , of a rule on an instance , we define a basic consequence of a schema by a rule , denoted by , as a finite schema for which . It is now easy to see that the consequence schema for a set of rules is obtained by repeatedly executing for all until no new pattern is inferred. Formally, , with , and , and (modulo variable names). In the following section we focus on the core of our problem which is computing a single basic schema consequence , and describe two approaches for this, namely Schema Consequence by Critical Instance (), and Schema Consequence by Query Rewriting ().

4 Computing the Basic Schema Consequence

Given a schema and a rule , our approach to compute the basic schema consequence for on is based on evaluating , or an appropriate rewriting thereof, on a “canonical” instance of , representative of all instances modelled by the schema. The mappings generated by this evaluation are then (1) filtered (in order to respect certain literal restrictions in RDF) and (2) applied appropriately to the consequent to compute the basic schema consequence.

We present two approaches, that use two different canonical instances. The first instance is based on the concept of a critical instance, which has been investigated in the area of relational databases before [20] (and similar notions in the area of Description Logics [11]). Adapted to our RDF setting, the critical instance would be created by substituting the variables in our schema, in all possible ways, with constants chosen from the constants in and as well as a new fresh constant not in or . In [20] this instance is used in order to decide Chase termination; Chase is referred to rule inference with existential rules, more expressive than the ones considered here and for which the inference might be infinite (see [4] for an overview of the Chase algorithm). Although deciding termination of rule inference is slightly different to computing the schema consequence, we show how we can actually take advantage of the critical instance in order to solve our problem. Nevertheless, this approach, that we call critical, creates prohibitively large instances when compared to the input schema. Thus, later on in this section we present a rewriting-based approach, called score, that runs a rewriting of the rule on a much smaller canonical instance of the same size as .

The Critical Approach. For both versions of our algorithms we will use a new fresh URI such that . Formally, the critical instance is the set of triples:

The critical instance replaces variables with URIs and literals from the set , while making sure that the result is a valid RDF graph (i.e. literals appear only in the object position) and that it is an instance of the original schema (i.e. not substituting a variable in with a literal). In order to compute the triples of our basic schema consequence for rule we evaluate on the critical instance, and post-process the mappings as we will explain later. Before presenting this post-processing of the mappings we stretch the fact that this approach is inefficient and as our experiments show, non scalable. For each triple in the input schema , up to new triples might be added to the critical instance.

The Score Approach. To tackle the problem above we present a novel alternative solution based on query rewriting, called score. This alternative solution uses a small instance called the sandbox instance which is obtained by taking all triple patterns of our schema graph and substituting all variables with the same fresh URI . This results in an instance with the same number of triples as . Formally, a sandbox graph is the set of triples:

Contrary to the construction of the critical instance, in our sandbox graph, variables are never substituted with literals (we will deal with RDF literal peculiarities in a post-processing step). Also notice that and . As an example, consider the sandbox graph of schema from Section 3:

) = ,
      ,
      ,
      

The critical instances and from our example would contain all the triples in , plus any other triple obtained by substituting some variables with constants other than , such as the triple: . A complete example of is available in our appendix.

In order to account for all mappings produced when evaluating on we will need to evaluate a different query on our sandbox instance, essentially by appropriately rewriting into a new query. To compute mappings, we consider a rewriting of , which expands each triple pattern in into the union of the 8 triple patterns that can be generated by substituting any number of elements in with . Formally, is the conjunction of disjunctions of triple patterns:

When translating this formula to SPARQL we want to select mappings that contain a binding for all the variables in the query, so we explicitly request all of them in the select clause. For example, consider graph pattern , which is interpreted as query:

SELECT ?v3 ?v4 WHERE { ?v3 :a ?v4 . ?v3 :b :c }

Query rewriting then corresponds to:

SELECT ?v3 ?v4 WHERE {
  { {?v3 :a ?v4} UNION {: :a ?v4} UNION {?v3 : ?v4}
    UNION {?v3 :a :} UNION {: : ?v4} UNION {: :a :}
    UNION {?v3 : :} UNION {: : :} }
  { {?v3 :b :c} UNION {: :b :c} UNION {?v3 : :c}
    UNION {?v3 :b :} UNION {: : :c} UNION {: :b :}
    UNION {?v3 : :} UNION {: : :} } }

Below we treat as a union of conjunctive queries, or UCQ [1], and denote a conjunctive query within it.

Having defined how the critical and score approaches compute a set of mappings, we now describe the details of the last two phases required to compute a basic schema consequence.

Filtering of the mappings This phase deals with processing the mappings computed by either critical or score, namely or . It should be noted that it is not possible to simply apply the resulting mappings on the consequent of the rule, as such mappings might map a variable in the subject or predicate position to a literal, thus generating an invalid triple pattern. Moreover, it is necessary to determine which variables should be included in the no-literal set of the basic schema consequence. The schema , output of our approaches, is initialised with the same graph and no-literal set as (i.e. , ). We then incrementally extend on a mapping-by-mapping basis until all the mappings have been considered, at which point, represents the final output of our approach.

For each mapping in or , we do the following. We create a temporary no-literal set . This set will be used to keep track of which variables could not be bound to any literals if we evaluated our rule antecedent on the instances of , or when instantiating the consequence of the rule. We initialise with all the variables of our rule that occur in the subject or predicate position in some triple of or , as we know that they cannot be matched to or instantiated with literals.

Then, we consider the elements that occur in the object position in the triples of . We take all the rewritings of in (if using critical, it would be enough to consider a single rewriting with ). Since the mapping has been computed over the canonical instance ( or depending on the approach), we know that there exists at least one such that belongs to the canonical instance. We compute the set of schema triples that model , for any of the above . Intuitively, these are the schema triples that enable , or one of its rewritings, to match the canonical instance with mapping . If is a literal , or a variable mapped to a literal by , we check if there exists any from the above such that or is a variable that allows literals (not in ). If such triple pattern doesn’t exist, then cannot be an instance of since it has a literal in an non-allowed positions, and therefore we filter out or disregard . If is a variable mapped to in , we check whether there exists a such that is a variable that allows literals (not in ). If such cannot be found, we add variable to . Intuitively, this models the fact that could not have been bound to literal elements under this mapping. Having considered all the triples we filter out mapping if it binds any variable in to a literal. If hasn’t been filtered out, we say that rule is applicable, and we use to expand .

(a)
(b)
Figure 3: Average time to compute 20 schema consequences using score and critical as the schema size grows. The other parameters are: , , , , . Due to large difference in performance, subplots (a) and (b) focus, respectively, on critical and score.

Schema Expansion. For each mapping that is not filtered out, we compute the substitution , which contains all the bindings in that map a variable to a value other than , and for every binding in , a variable substitution where is a fresh new variable. We then add triple patterns to and then add the variables to .

Although the schema consequences produced by and might not be identical, they are semantically equivalent. This notion of equivalence is captured by the following theorem.

Theorem 1

For all rules and triplestore schemas , .

The score approach (and by extension also critical, by Theorem 1) is sound and complete. The following theorem captures this notion by stating the semantic equivalence of and .

Theorem 2

For all rules and triplestore schemas , .

For our proofs, we refer the reader to our appendix.

Termination. It is easy to see that our approaches terminate since our rules do not contain existential variables, and do not generate new URIs or literals (but just fresh variable names). After a finite number of iterations, either approach will only generate isomorphic (and thus equivalent) triple patterns.

Complexity. Our central problem in this paper, computing the schema consequence for a set of rules, can be seen as a form of datalog evaluation [1] on our critical or sandbox instance. Datalog has been extensively studied in databases and its corresponding evaluation decision problem (whether a given tuple is in the answer of a datalog program) is known to be EXPTIME-complete in the so called combined complexity [25], and PTIME-complete in data complexity  [25, 9]. Data complexity in databases refers to the setting in which the query (or datalog program) is considered fixed and only the data is considered an input to the problem. In our setting, data complexity refers to the expectation that the overall number and complexity of the rules remains small. It is not hard to see that the corresponding decision problem, stated below, remains PTIME-complete in data complexity. The intuition is that once we construct the critical instance in polynomial time (or alternatively, we grow our set of rules by a polynomial rule rewriting) we have essentially an equivalent problem to datalog evaluation (our rules being the datalog program, and the canonical instance being the database).

Theorem 3

Given triple pattern and schema as inputs, the problem of deciding whether is in the consequence schema of for a fixed set of rules is PTIME-complete.

5 Experimental Evaluation

Figure 4: Average time to compute 20 schema consequences using score as the number of rules grows, for three configuration of . The other parameters are set as follows: , , , .

We developed a Java implementation of the score and critical approaches and evaluated them on synthetic datasets to compare their scalability. We developed a synthetic schema and rule generator that is configurable with 7 parameters: , which we now describe. To reflect the fact that triple predicates are typically defined in vocabularies, our generator does not consider variables in the predicate position. Random triple patterns are created as follows. Predicate URIs are randomly selected from a set of URIs

. Elements in the subject and object position are instantiated as constants with probability

, or else as new variables. Constants in the subject positions are instantiated with a random URI, and constants in the object position with a random URI with probability, or otherwise with a random literal. Random URIs and literals are selected, respectively, from sets and (

). We consider chain rules where the triples in the antecedent join each other to form a list where the object of a triple is the same as the subject of the next. The consequent of each rule is a triple having the subject of the first triple in the antecedent as a subject, and the object of the last triple as object. An example of such rule generated by our experiment is:

In each run of the experiment we populate a schema and a set of rules having triples in the antecedent. To ensure that some rules in each set are applicable, half of the schema is initialized with the antecedents triples of randomly selected rules. The other half is populated with random triple patterns. We initialize with all the variables in the subject and predicate position in the triples of . The code used in this experiments is avaliable on GitHub;111https://github.com/paolo7/ap2 it uses Apache Jena222https://jena.apache.org/ to handle query execution. We run the experiments on a standard Java virtual machine running on an Ubuntu 16.04 computer with 15.5 GB RAM, an Intel Core i7-6700 Processor. Average completion times of over 10 minutes have not been recorded.

Figure 3 shows the time to compute the schema consequence for different schema sizes using critical and score. The parameters have been chosen to be small enough to accommodate for the high computational complexity of the critical approach. This figure shows that score is orders of magnitude faster, especially on large schema sizes. The critical approach, instead, times out for schema sizes of over 33 triples.

Figure 4 shows the time to compute the schema consequence for different antecedent sizes and rule numbers . The critical approach is not present in this figure, as it timed out in all the configurations. As this figure shows, the score approach can easily scale to a large set of rules. Given the complexity of SPARQL query answering [22], we can also notice an exponential increase in computation time as more triples are added to the antecedent of a rule. In our experiment setup, the score approach scales to rules with antecedent sizes of up to 12 triples, before timing out.

6 Related Work

To the best of the authors’ knowledge, our approach is the first to determine the applicability of inference rules to types of RDF triplestores specified by their schema, and to expand their schema with the potential consequences of such rules. Unlike related work on provenance paths for query inferences [14], we do not explicitly model the dependencies between different rules. Instead, we compute their combined potential set of inferences by expanding the original schema on a rule-by-rule basis, through multiple iterations, following the basic principles of the chase algorithm. We choose to follow a subset of the RDF data model, and not a simpler graph model such as generalised RDF [8], to make our approach more applicable in practice, and compatible with existing tools. We pay particular attention to literals, as they are likely to feature prominently in IoT sensor measurements.

A possible application of our approach is to facilitate the sharing and reusing of inference rules. A relevant initiative in the IoT domain is Sensor-based Linked Open Rules (S-LOR) [12]

, which provides a comprehensive set of tools to deal with rules, including a rule discovery mechanism. By classifying rules according to sensor types, domain experts can discover and choose which inference rules are most relevant in a given scenario. Our approach could automate parts of this process, by selecting rules applicable to the available data sources. We refer to

[24] for a comprehensive review of rule-based reasoning systems applicable to IoT.

Our approach to define a triplestore schema is related to a number of similar languages, and in particular to Shape Expressions (ShEx) [23] and the Shapes Constraint Language (SHACL) [18]. The term shape, in this case, refers to a particular constraint on the structure of an RDF graph. ShEx and SHACL can be seen as very expressive schema languages, and computing schema consequences using such schemas would be impractical. In fact, each inference step would need to consider complex interpendencies between shapes and the values allowed in each triple element, and thus we would generate increasingly larger sets of contrainsts. The triplestore schema proposed in this paper is a simpler schema language and, if we disallow variables in the predicate position, can be modelled as a subset of both ShEx and SHACL.

7 Conclusion

As its main contribution, this paper presented two approaches to determine the applicability of a set rules with respect to a database schema (i.e. if the rule could ever match on any dataset modelled by the schema), by expanding such schema to model the potential facts that can be inferred using those rules, which we call schema consequence. This can be applied in IoT scenarios, where inference rules are used to aggregate sensor readings from diverse data sources in order to automate health and safety policies. As the underlying data sources evolve, it is important to determine whether rules are still applicable, and what they can infer.

We focused on RDF triplestores, and on inference rules that can be modelled as SPARQL queries, such as SPIN and SWRL. To do so, we defined a notion of a triplestore schema that constrains the type of triples allowed in a graph. This differs from the RDF schema (RDFS) specification, which is designed to define vocabularies. While we provide an example on how to describe the schema of simple sensor networks, we think extending this approach to more expressive schema languages could be an interesting venue for future work.

The first of the two approaches that we presented is based on the existing notion of a critical instance; the second on query rewriting. We have theoretically demonstrated the functional equivalence of the approaches, as well as their soundness and completeness. Moreover, we have provided experimental evidence of the superior scalability of the second approach, which can be applied over large schemas and rulesets within seconds.

With this paper we intend to provide a first theoretical framework to reason about rule applicability. We hope that our approach will pave the way, on one hand, for efficient and meaningful policy reasoning in IoT systems, and on the other, for new and interesting rewriting-based schema reasoning approaches in the knowledge representation and databases research areas.

References

  • [1] Abiteboul, S., Hull, R., Vianu, V.: Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc. (1995)
  • [2] Bassiliades, N.: SWRL2SPIN: A tool for transforming SWRL rule bases in OWL ontologies to object-oriented SPIN rules. CoRR abs/1801.09061 (2018), http://arxiv.org/abs/1801.09061
  • [3] Beimel, D., Peleg, M.: Editorial: Using owl and swrl to represent and reason with situation-based access control policies. Data Knowl. Eng. 70(6), 596–615 (2011)
  • [4] Benedikt, M., Konstantinidis, G., Mecca, G., Motik, B., Papotti, P., Santoro, D., Tsamoura, E.: Benchmarking the chase. In: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. pp. 37–52. ACM (2017)
  • [5] Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-Muro, M., Xiao, G.: Ontop: Answering SPARQL queries over relational databases. Semantic Web 8(3), 471–487 (2017)
  • [6] Ceri, S., Gottlob, G., Tanca, L.: What you always wanted to know about datalog (and never dared to ask). IEEE Transactions on Knowledge and Data Engineering 1(1), 146–166 (1989)
  • [7] Chaochaisit, W., Sakamura, K., Koshizuka, N., Bessho, M.: CSV-X: A Linked Data Enabled Schema Language, Model, and Processing Engine for Non-Uniform CSV. 2016 IEEE Int. Conf. on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) pp. 795–804 (2016)
  • [8] Cyganiak, R., Wood, D., Markus Lanthaler, G.: RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation, W3C (2014), http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/
  • [9]

    Dantsin, E., Eiter, T., Gottlob, G., Voronkov, A.: Complexity and expressive power of logic programming. ACM Computing Surveys (CSUR)

    33(3), 374–425 (2001)
  • [10] Fortineau, V., Fiorentini, X., Paviot, T., Louis-Sidney, L., Lamouri, S.: Expressing formal rules within ontology-based models using SWRL: an application to the nuclear industry. International Journal of Product Lifecycle Management 7(1), 75–93 (2014), pMID: 65458
  • [11] Glimm, B., Kazakov, Y., Liebig, T., Tran, T.K., Vialard, V.: Abstraction refinement for ontology materialization. In: International Semantic Web Conference. pp. 180–195. Springer (2014)
  • [12] Gyrard, A., Serrano, M., Jares, J.B., Datta, S.K., Ali, M.I.: Sensor-based Linked Open Rules (S-LOR): An Automated Rule Discovery Approach for IoT Applications and Its Use in Smart Cities. In: 26th International Conference on World Wide Web Companion. pp. 1153–1159. WWW ’17 (2017)
  • [13] Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. W3C Recommendation, W3C (2013), https://www.w3.org/TR/sparql11-query/
  • [14] Hecham, A., Bisquert, P., Croitoru, M.: On the Chase for All Provenance Paths with Existential Rules. In: Rules and Reasoning. pp. 135–150. Springer International Publishing (2017)
  • [15] Horrocks, I., Patel-Schneider, P.F., Boley, H., Tabet, S., Grosof, B., Dean, M., et al.: SWRL: A semantic web rule language combining OWL and RuleML. W3C Member Submission, W3C (2004), https://www.w3.org/Submission/SWRL/
  • [16] International Labour Organization: Act No. 6331 on Occupational Health and Safety (2012), https://www.ilo.org/dyn/natlex/natlex4.detail?p_lang=fr&p_isn=92011
  • [17] Knublauch, H.: SPIN - SPARQL Syntax. W3C Member Submission, W3C (2011), http://www.w3.org/Submission/spin-sparql/
  • [18] Knublauch, H., Kontokostas, D.: Shapes constraint language (SHACL). W3C Recommendation, W3C (2017), https://www.w3.org/TR/shacl/
  • [19] Lefrançois, M., Cox, S., Taylor, K., Haller, A., Janowicz, K., Phuoc, D.L.: Semantic Sensor Network Ontology. W3C Recommendation, W3C (2017), https://www.w3.org/TR/2017/REC-vocab-ssn-20171019/
  • [20] Marnette, B.: Generalized schema-mappings: from termination to tractability. In: Proc. of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symp. on Principles of database systems. pp. 13–22. ACM (2009)
  • [21] Perera, C., Zaslavsky, A., Christen, P., Georgakopoulos, D.: Context Aware Computing for The Internet of Things: A Survey. IEEE Communications Surveys Tutorials 16(1), 414–454 (2014)
  • [22] Pérez, J., Arenas, M., Gutierrez, C.: Semantics and Complexity of SPARQL. ACM Transactions on Database Systems 34(3), 16:1–16:45 (2009)
  • [23] Prud’hommeaux, E., Labra Gayo, J.E., Solbrig, H.: Shape Expressions: An RDF Validation and Transformation Language. In: Proceedings of the 10th International Conference on Semantic Systems. pp. 32–40. SEM ’14, ACM (2014)
  • [24] Serrano, M., Gyrard, A.: A Review of Tools for IoT Semantics and Data Streaming Analytics. Building Blocks for IoT Analytics p. 139 (2016)
  • [25]

    Vardi, M.Y.: The Complexity of Relational Query Languages (Extended Abstract). In: Proceedings of the Fourteenth Annual ACM Symposium on Theory of Computing. pp. 137–146. STOC ’82, ACM, New York, NY, USA (1982)

Appendix A Proof of Theorem 1

Theorem 1

For all rules and triplestore schemas , .

Proof: Since the mappings generated by the critical and score approaches (namely and ) are post-processed in the same way, we demonstrate the equivalence of the two approaches by demonstrating that () the set of mappings generated by , which are not filtered out by our post-processing of the mappings, is a subset of the mappings generated by and that () every mapping is redundant with respect to another mapping in . We denote with the set of mappings that are not filtered out by our filtering function with respect to and . We say that a mapping is redundant with respect of a set of mappings (and a schema and rule ) if the schema consequences computed over and are semantically equivalent.

Note that both the critical and the sandbox instance are constructed by taking a triple in schema and substituting its variables. Thus we can associate each triple in the critical or the sandbox instance with the triples in that can be used to construct them, and we will call the latter the origin triples of the former. For each triple in the critical or the sandbox instance, at least one origin triple must exist.

) Let a mapping . We get all triples of and for each one we will construct a triple of a conjunctive query , and mapping of into , such that either or makes redundant. For , if is a variable and then set , since there must be a triple in with in the position so if mapped on that triple, (for all triples of the critical instance that have in position , its origin triples have variables in the same position, so the sandbox instance would also have in position ). If is a variable and is a constant other than , then we distinguish two cases (a) and (b). Let be an origin triple of in the critical instance, then (a) if then would have retained in the corresponding position in the triple of which is the origin, and so we set to , in order for to also belong to mapping of into , or (b) if is a variable (we know that is the element in position of the triple in the sandbox graph of which is an origin) we consider two sub-cases () and ().

We set to in case (), namely if there is a position in a triple (different position to , i.e., if ), for which and for , an origin triple of , . Condition () essentially says that will have to be mapped to in the sandbox graph due to some other position of the rewriting, so can just be set to to simply match the corresponding triple in . In case () we set to ; this condition produces a more general mapping since while will be . We say that is more general than , denoted by , if and for all , either or .

Lastly we consider the case of being a constant. If there is an origin triple of in the critical instance such that is the same constant we also set it to . Otherwise, if is a variable in any origin triple we set to . This does not change the mapping under construction . By following the described process for all triple in we end up with a rewriting of and a mapping from to such that is more general than .

The existance of the more general mapping makes mapping redundant. This is true because if is not filtered out by our post-processing, then would also not be filtered out (since mappings can only be filtered out only when a variable is mapped to a literal, but does not contain any such mapping not already in ). If is not filtered out, then this would lead to the schema being expanded with triples . The instances of the schema obtained through mapping are a subset of those obtained through mapping . This can be demonstrated by noticing that for every triple in there is a corresponding triple in that has at each position, either the same constant or , which is then substituted with a variable in the schema consequence. This variable can always be substituted with the same constant , when instantiating the schema, and therefore there cannot be an instance of the schema consequence generated from mapping that is not an instance of the schema consequence generated from mapping . In fact, the only case in which a variable in a schema cannot be instantiated with a constant is when is a literal and is in the no-literal set. However, for all mappings , we can notice that whenever our post-processing function adds a variable to the no-literal set, it would have rejected the mapping if that variable was mapped to a literal instead.

) Let a mapping . This means that there is a , for which . We get all triples of and for each one we will show that for the triple in , that was rewritten into , it holds that either is filtered out by the filtering function , or it is also a mapping from into the critical instance (i.e. ), hence . This would prove that the mappings generated by are either subsumed by the mappings in , or they would be filtered out anyway, and thus .

For all , for all , if is a variable then this position has not been changed from rewriting into , thus will exist in ; if is , then , and therefore in any origin triple of , had to be a variable and so will exist in position in the triples the critical instance will create for , and for these triples will partially map onto them. Similarly, if is a variable and is a constant , then this constant would be present in and so also in all triples coming from in the critical instance; again will partially map onto the triples in the critical instance generated from .

If is a constant other than , then the triple that got rewritten to triple would have the same constant in position . Also, this constant must be present in position in the triple of the sandbox graph, and therefore in position of any of its origin triples . In fact, by virtue of how the sandbox graph is constructed, any triple of the schema that does not have this constant in position , cannot be an origin triple of . Thus again, will partially map onto the triples in the critical instance of which is the origin triple.

Lastly, if is , then , and in any origin triple of the sandbox triple we have a variable in position , and in the critical instance we have triples, of which is the origin, with all possible URIs or in this position; this means that if is a URI, in the triple that got rewritten in , we can “match” it in the critical instance: if is a variable then will match the triple in the critical instance (instance of any origin triple of ) that contains in position (notice that should contain a value for since all variables are present in all rewritings of ); else, if is a URI there will be a corresponding image triple (again instance of any ) in the critical instance with the same constant in this position.

If, however, is a literal, it is possible that there is no triple in the critical instance that can match the literal in position . We will show that, in this case, the filter function will remove this mapping. If literal does not occur in position in any of the triples in the critical instance that have the same origin triple as then, by the definition of how the critical instance is constructed, this must be caused by the fact that either (a) or (b) for all the possible origin triples of , or is a constant other than . In both cases, our filtering function will filter out the mapping. In fact, it filters out mappings where variables are mapped to literals, if they occur in the subject or predicate position of a triple in . This is true in case (a). Moreover it filters out mappings if a variable in the object position there is a literal , or a variable mapped to a literal , such that there is no origin triple of such that or is a variable not in . This is true in case (b). Thus is a mapping from into the critical instance.

Appendix B Proof of Theorem 2

Theorem 2

For all rules and triplestore schemas , .

Theorem 2 states that is semantically equivalent to . We prove this by showing that () every triple that can be inferred by applying rule on an instance of is an instance of the schema and that () every triple in every instance of schema can be obtained by applying rule on an instance of . To reduce the complexity of the proof, we consider the case without literals (i.e. when no literal occurs in and , and all of the variables in the schema are in ). This proof can be extended to include literals by considering the the post-processing of the mappings, as demonstrated in the proof of Theorem 1. More precisely, we reformulate Theorem 2 as follows:

Lemma 1

For all rules and for all triplestore schemas such that and that no literal occurs in and :

) for all triple patterns , for all triples in an instance of , there exists a triple pattern s.t.

) for all triple patterns , for all triples in an instance of there exists a triple pattern s.t. .

Proof: ) Given the premises of this part of Lemma 1, there must exist a graph on which an application of rule generates . For this to happen, there must exist a mapping , such that . Obviously, the set of triples on which matches into via is .

For all triples there exists a triple pattern that models (e.g.  ). We choose one of such triple pattern (in case multiple ones exist) and we call it the modelling triple of . To prove the first part of Lemma 1 we will make use of the following proposition:

There exists a mapping and a rewriting , such that (thus ), and that and agree on all variables except for those that are mapped to in . Formally, and for every variable in , either , or .

By proving the proposition above we know that our algorithm would extend the original schema graph with graph pattern , and the set of variables with , where is a function that substitutes each in a graph with a fresh variable. If we take a triple pattern such that , then belongs to (because matches a rewriting of the antecedent of the rule on the sandbox graph). Graph also models triple . This is true because for each of the three positions of the schema triple pattern , either is a constant, and therefore this constant will also appear in position in both and or it is a variable, and by our proposition above either (a) and so , or (b) and therefore is a variable. We can then trivially recreate as an instance of by assigning to each variable in the value . Thus our algorithm would generate a triple pattern that models and that would complete the proof of the first part () of the Theorem.

The proof of our proposition is as follows. Consider every position in every triple , and their corresponding modelling triple . Let be the triple in such that . Let be , and thus . We are now going to construct a query , rewriting of , such that gives us the mapping in our proposition. And to do this, we have to construct a triple pattern which will be the rewriting of in . By definition of how our rewritings are constructed, every element is either or .

To set the value of each element in the triple pattern we consider the four possible cases that arise from and being either a constant or a variable. In parallel, we consider the element-by-element evaluation of on which generates .

  1. If and are both constants, then since , it must be true that . Moreover, since must be a model of , it follows that and therefore . We set element to be the constant , which matches the constant .

  2. If is a constant but is a variable, then we know that . We set element to be the constant , which matches the constant .

  3. If is a variable and is a constant , then we know that . Therefore mapping must contain binding . We set element to be the variable , so that when we complete the construction of , will contain mapping .

  4. If and are both variables, it must be that . If it exists a triple and a modelling triple of , and position such that is variable and is a constant, then we set element to be the constant ( will be in our sandobox graph). Note that even though we don’t use variable in this part of our rewriting, the aforementioned existence of triple