Domain-specific knowledge graphs are playing an increasingly important role to derive business insights in many enterprise applications such as customer engagement, fraud detection, network management, etc (Noy:2019, ). One distinct characteristic of these enterprise knowledge graphs, compared to the open-domain knowledge graphs like DBpedia (LehmannIJJKMHMK15, ), Freebase (Bollacker08freebase, ), and YAGO2 (Suchanek:2008, ), is their deep domain specialization. The domain specialization is typically captured by an ontology which provides a semantic abstraction to describe the entities and their relationships of the data in the knowledge graphs. The ontology is often used to drive the creation of a knowledge graph by ingesting and transforming raw data from multiple sources into standard terminologies. The curated knowledge graphs allow users to express their queries in standard vocabularies, which promotes more interoperable and effective enterprise applications and services for specific domains (DBLP:series/synthesis/2015Dong, ; DBLP:series/synthesis/2015Christophides, ).
There are two popular approaches to store and query knowledge graphs: RDF data model and SPARQL query language (DBLP:journals/sigmod/SakrA09, ) or property graph model and graph query languages such as Gremlin (DBLP:conf/sigmod/SunFSKHX15, ) and Cypher (DBLP:conf/sigmod/FrancisGGLLMPRS18, ). An important difference between RDF and property graphs is that RDF regularizes the graph representation as a set of triples, which means that even literals are represented as graph vertices. Such artificial vertices make it hard to express graph queries in a natural way. The property graph model instead uses vertices to represent entities and edges to represent the relationships between them, with each specified using key-value properties pairs (DBLP:conf/grades/RestHKMC16, ). For this reason, property graph systems are rapidly gaining popularity for graph storage and retrieval. Examples include Neo4j (neo4j, ), Apache JanusGraph (janus, ), Azure Cosmos DB (cosmos, ), Amazon Neptune (neptune, ), to name a few. Many techniques have been proposed for optimizing the query performance, system scalability, and transaction support for these systems (DBLP:conf/icde/NeumannM11, ; DBLP:conf/icde/MeimarisPMA17, ; DBLP:conf/edbt/TsialiamanisSFCB12, ; DBLP:conf/sigmod/BorneaDKSDUB13, ). However the problem of property graph schema optimization has been largely ignored, which is also critical to graph query performance.
In this paper, we tackle the property graph schema optimization problem for domain-specific knowledge graphs. Our goal is to create an optimized schema111We use the terms property graph schema, graph schema, and schema interchangeably.
based on a given ontology, such that the corresponding property graph can efficiently support various types of graph queries (e.g., pattern matching, path finding, or aggregation queries) with better query performance. The raw data is loaded directly as a property graph that conforms to the optimized schema222A property graph schema may not be logically equivalent to a given ontology. Capturing the full expressivity of ontologies (e.g., negation, role inclusion, transitivity) in the form of a property graph schema is an unexplored and challenging problem, which is beyond the scope of this work.. One straightforward way to create a property graph schema from an ontology is to directly map each ontology concept to a schema node, and to map each ontology relationship to a schema edge, analogous to ER diagram to relational schema mapping. However, we argue that the graph query performance varies vastly for different property graphs with the same data but corresponding to different schemas, and the rich semantic information in the ontology provides unique opportunities for schema optimization. We illustrate this using two examples from the medical domain.
Example 1 (Pattern matching query). Consider the ontology in Figure 1(a), summary is a property of DrugInteraction concept, which is connected to DrugFoodInteraction and DrugLabInteraction concepts via inheritance (isA) relationships. Figures 1(b) and 1(c) show two alternative property graphs conforming to two different schemas with several vertices and edges. In Figure 1(b), the vertex di1 (i.e., an instance of DrugInteraction) leads to both dfi1 and dli1. In Figure 1(c), drug1 directly connects to dfi1 and dli1 vertices. For any query that requires edge traversals from drug1 to either dfi1 or dli1 or both, the property graph 2 clearly requires less number of edge traversals. A pattern matching query interested in Drug and the associated risk of DrugFoodInteraction achieves 2 orders of magnitude performance gains on the optimized property graph (23ms) compared to the property graph 1 (3245ms).
Example 2 (Aggregation query). In Figure 1(a), Drug concept is also connected to Indication concept via a treat (:) relationship. In this case, we observe that if we replicate certain properties accessible via a 1:M relationship, edge traversals can be avoided. Figure 1(c) shows that the vertex drug1 has an additional property, which is a list of descriptions replicated from the property desc of ind1 and ind2. An aggregation query (COUNT) on the desc of Indication treated by Drug runs 8 times faster on this optimized property graph (78ms) than the property graph 1 (627ms). In this case, avoiding the edge traversals is extremely beneficial, especially when the number of edges between these two types of vertices is large.
These two examples show that edge traversal is one of the dominant factors affecting graph query performance, and having an optimized schema can greatly improve query performance. Moreover, the rich semantic relationships in an ontology provide a variety of opportunities to reduce graph traversals. To generate an optimized graph schema, we need to identify and exploit these opportunities in the ontology, and design different techniques to utilize them accordingly. As illustrated in the examples, certain optimization techniques require data replication resulting in space overheads. Hence, the schema optimization has to trade off between the query performance and the space consumption of the resulting property graph.
Our proposed approach. To the best of our knowledge, we are the first to address the problem of property graph schema optimization to improve graph query performance. In addition to the ontology, our approach also takes into account the space constraints, if any, and additional information such as data distribution and workload summaries333We refer to the access frequency of concepts, relationships and properties as workload summaries which will be formally defined later.
. We propose a set of rules that are designed to optimize the graph query performance with respect to different types of relationships in the ontology. When there is a space constraint, we estimate the cost-benefit of applying these rules to each individual relationship by leveraging the additional data distribution and workload information. We propose two algorithms, concept-centric and relation-centric, which incorporate the cost-benefit scores to produce an optimized property graph schema. Our approach can seamlessly handle updates to the property graph, as long as its schema remains unchanged.
Contributions. The contributions of this paper can be summarized as follows:
1. We introduce an ontology-driven approach for property graph schema optimization.
2. We design a set of rules that reduce the edge traversals by exploiting the rich semantic relationships in the ontology, resulting in better graph query performance.
3. We propose concept-centric and relation-centric algorithms that harness the proposed rules to generate an optimized property graph schema from an ontology, under space constraints. The concept-centric algorithm utilizes the centrality analysis of concepts, and the relation-centric algorithm uses a cost-benefit model.
4. Our experiments show that our ontology-driven approach effectively produces optimized graph schemas for two real-world knowledge graphs from medical and financial domains. The queries over the optimized property graphs achieve up to 2 orders of magnitude performance gains compared to the graphs resulting from the baseline approach.
The rest of the paper is organized as follows. Section 2 introduces the basic concepts, formulates the problem, and provides an overview of our ontology-driven approach. Section 3 describes our optimization rules for different types of relationships in an ontology. Section 4 explains the algorithms to produce optimized property graph schema. We provide our experimental results in Section 5, review related work in Section 6, and finally conclude in Section 7.
2. Preliminaries & Approach Overview
An ontology describes a particular domain and provides a structured view of the data. Specifically, it provides an expressive data model for the concepts that are relevant to that domain, the properties associated with the concepts, and the relationships between concepts.
Definition 1 (Ontology ()).
An ontology (, , ) contains a set of concepts , a set of data properties , and a set of relationships between the concepts .
An ontology is typically described in OWL (owl, ), wherein a concept is defined as a class, a property associated with a concept is defined as a DataProperty and a relationship between a pair of concepts is defined as an ObjectProperty. Each DataProperty represents a characteristic of a concept and represents the set of DataProperties associated with the concept . Each ObjectProperty is associated with a source concept , also referred to as the domain of the ObjectProperty, a destination concept , also referred to as the range of the ObjectProperty, and a type . The type can be either a functional (i.e., 1:1, 1:M, M:N), an inheritance (a.k.a isA) or a union/membership relationship444Even if inheritance and union are not ObjectProperties, we simplify the notation for presentation purposes.. In this paper, we use the ontology as a semantic data model of a knowledge graph.
We adopt the widely used property graph model from (Robinson:2013, ).
Definition 2 (Property Graph ()).
A property graph (V, E) is a directed multi-graph with vertex set and edge set , where each node and each edge has data properties consisting of multiple attribute-value pairs.
Similar to a relational database schema that describes tables, columns, and relationships of a relational database, the property graph schema is critical for creating high-quality domain-specific graphs. A property graph instantiated from a property graph schema provides agile and robust knowledge services with correctness, coverage, and freshness (Noy:2019, ).
A property graph schema can be specified in a data definition language such as Neo4j’s Cypher (DBLP:conf/sigmod/FrancisGGLLMPRS18, ), TigerGraph’s GSQL (DBLP:journals/corr/abs-1901-08248, ), or GraphQL SDL (Hartig:2019:DSP:3327964.3328495, ). They all define notions of node types and edge types, as well as property types that are associated with a node type or with an edge type. We adopt Cypher due to its popularity, but our proposed techniques are independent of the aforementioned languages.
Table 1 provides the notations used in this paper.
|: a concept in an ontology|
|: a relationship in an ontology|
|all data properties associated to|
|all incoming relationships of|
|all outgoing relationships of|
|the source concept of|
|the destination concept of|
|the relationship type of (i.e., 1:1, ,|
|, 1:M, or M:N)|
|a property graph|
|all instance vertices of|
|, an instance vertex of|
|a property of|
2.2. Approach Overview
Given an ontology, the problem of property graph schema optimization is to create a property graph schema such that the corresponding property graph can efficiently support various types of graph queries (e.g., pattern matching, path finding, or aggregation queries). In real knowledge graph applications, there is typically a space constraint on the graph size due to monetary cost (Kllapi:2011, ). Hence any practical solution has to incorporate the space constraint to produce an optimized property graph schema.
Figure 3 gives an overview of how the property graph schema is generated. The property graph schema generator takes as input an ontology and optional data statistics, workload summaries, as well as a space constraint. It utilizes a set of rules designed for different types of relationships to produce an optimized property graph schema and a mapping between the ontology to the optimized schema. The schema describes vertices, edges, and properties of a property graph to be instantiated in a graph database (e.g., Neo4j or JanusGraph). The mapping provides the provenance of schema transformation introduced by the relationship rules (Section 3). At query time, a user issues graph queries against the original ontology using standard vocabularies. The query rewriter is responsible for rewriting the queries based on the mapping. Hence the stored property graph schema is transparent to the user, which allows the user to focus on expressing graph queries independent of the actual schema.
3. Relationship Rules
Graph queries often involve multi-hop traversal or vertex attribute lookup/analytics on property graphs. As shown in the motivating examples, edge traversals over a graph are vital to the overall query performance. Hence, we focus on the rich semantic relationships in an ontology and propose a set of novel rules for different types of relationships. These rules minimize edge traversals and consequently improve graph query performance.
Union Rule. In an ontology, a union relationship () contains a union concept () and a member concept (). Each instance of a union concept is an instance of one of its member concepts, and each instance of a member concept is also an instance of the union concept. Figure 2 shows that BlackBoxWarning and ContraIndication are two member concepts of a union concept Risk. A graph query accessing an instance of Risk is equivalent to accessing the instances of either BlackBoxWarning, or ContraIndication, or both. In other words, if we create a property graph directly from the ontology shown in Figure 2, then the queries starting from any vertices of either BlackBoxWarning or ContraIndication concepts have to traverse through some vertex of Risk in order to reach the vertices of Drug. This leads to unnecessary edge traversal.
Hence we propose a union rule to connect the member concepts directly to all concepts that are connected to the union concept (Algorithm 1). Figures 4(a) and 4(b) show the property graph schema and the corresponding property graph after applying the union rule to the above example. In the optimized property graph, retrieving the drugs (e.g., Ibuprofen) that cause Asthma requires only a single edge traversal, instead of 2 in the property graph directly instantiated from the ontology.
For each union relationship, the mapping from the ontology to the optimized property graph schema is recorded as [\ COPY_INTO ], where is the union concept, is the member concept, and is the set of union relationships of . This way the query rewriter can leverage this mapping to replace the original concept and relationship in a query to the corresponding ones in the optimized schema.
Inheritance Rule. An inheritance relationship () contains a parent concept () and a child concept (). Unlike a union concept, a parent concept in the inheritance relationship may have instances that are not present in any of its children concepts. This leads to the following possible scenarios.
Connect the child concept () directly to the concepts that are connected to its parent concept (), and attach all data properties of to ;
Connect the parent concept () directly to the concepts that are connected to its child concept (), and attach all data properties of to ;
Or leave , , and unchanged.
In the first two cases, edge traversals can be avoided in the property graph conforming to the property graph schema. Figure 2 shows that DrugFoodInteraction and DrugLabInteraction are two children concepts of DrugInteraction. Applying the inheritance rule to these concepts can lead to two alternative optimized property graph schemas shown in Figure 5. Figures 5(a) and 5(b) demonstrate the first scenario where the data properties (summary) of the parent concept DrugInteraction are directly attached to two children concepts DrugFoodInteraction and DrugLabInteraction. Figures 5(c) and 5(d) depict the second scenario where the data properties risk and mechanism of two respective children concepts are now attached to the parent concept DrugInteraction.
However, attaching the data properties () from the parent concept to the child concept incurs data replication as is shared among all children concepts (Figure 5(b)). If the number of data properties shared by the children concepts is large, the data replication can introduce significant space overhead. On the other hand, when the data properties () from the children concepts are replicated to their parent concept (), may end up with a large number of data properties (Figure 5(d)). However, these data properties may not exist in many instance vertices of . Consequently, the instance vertices of may consume unnecessary space. To remedy the above two issues, we propose to exploit the Jaccard similarity (Leskovec:2014, ) between and to decide the best strategy for the inheritance relationship:
As described in Algorithm 2, if is greater than a threshold , it indicates that the child concept shares a lot of data properties with its parent concept . Intuitively, this means that has only few properties in addition to the ones of . In this case, moving from the child concept to incurs less space overhead compared to the other way. Similarly, if is less than a threshold ( ), the child concept has little in common with its parent . Intuitively, this means that has many additional properties compared to . Therefore, it is more cost effective to make the data properties of the parent concept available at . In either case, the inheritance rule avoids edge traversals in the resulting property graph.
Note that the similarity score of a parent concept and a child concept remains unchanged even if new data properties are added to one or both concepts as a result of applying other rules. The reason is that the Jaccard similarity is computed based on the given ontology, as it represents the semantic similarity between two concepts with an inheritance relationship. Hence we calculate the Jaccard similarity score for all inheritance relationships before applying any rules.
For an inheritance relationship between a parent concept and a child concept , the mapping from the ontology to the optimized property graph schema is recorded as either [ COPY_INTO ], [ COPY_INTO ] or [ COPY_INTO ], [\ COPY_INTO ], where is the set of inheritance relationships of .
One-to-one Rule. A 1:1 relationship () indicates that an instance of can only relate to one instance of and vice versa (e.g., Indication and Condition in Figure 2). A 1:1 relationship can be simply removed by merging and into . Any query accessing instance vertices of and can be satisfied by looking up the merged instance vertex of . In Figure 6(a), IndicationCondition is the merged concept with two data properties, name and note, attached. Hence the edge traversal (e.g., from Drug to Condition in Figure 2) is avoided and the number of instance vertices (i.e., space consumption) is reduced as well. Algorithm 3 shows the one-to-one rule, which is straightforward to follow.
For a 1:1 relationship, the mapping from the ontology to the optimized property graph schema is simply recorded as [ MERGE_INTO ], [ MERGE_INTO ], and [ REMOVED].
One-to-many Rule. A 1:M relationship ( = (, )) indicates that an instance of can potentially refer to several instances of ). In other words, in a 1:M relationship, an instance of allows zero, one, or many corresponding instances of . However, an instance of cannot have more than one corresponding instance of .
To better support the aggregation (e.g., COUNT, SUM, AVG, etc.) and neighborhood (1-hop) lookup functions in graph queries, we propose to propagate each data property in to as a property of type (Figure 7(a)). The aggregation and neighborhood lookup functions can directly leverage these localized list properties instead of traversing through the edges of the 1:M relationships. As depicted in Figure 7(b), Indication.desc is a data property of drug2 consisting of a list of descriptions (i.e., [Fever, Headache]) that saves the aggregation queries edge traversals to the other instance vertices (e.g., ind1 and ind2). The potential savings can be substantial when there are many edges between instance vertices of two concepts such as Drug and Indication.
However, the newly introduced property of type introduces additional space overheads, which can be expensive depending on the data distribution. Therefore, choosing the appropriate set of data properties from each 1:M relationship to propagate is critical with respect to both query performance and space consumption. We will describe algorithms to choose the data properties to merge in Section 4.2. Algorithm 4 corresponds to the one-to-many rule.
For a 1:M relationship, the mapping from the ontology to the optimized property graph schema is recorded as [ COPY_INTO AS ].
Many-to-many Rule. An : relationship ( = (, )) indicates that an instance of can have several corresponding instances of , and vice versa. An : relationship is essentially equivalent to two : relationships, namely, and . Therefore, the many-to-many rule is identical to the one-to-many rule, except that the property propagation is done for both directions. Namely, a data property of () is propagated as a property of type to (), respectively. Hence applying the many-to-many rule leads to the same potential gains for queries with aggregate or neighborhood (1-hop) lookup functions at the cost of introducing additional space consumption.
In summary, all proposed rules reduce the number of edge traversals which improve graph query performance. However, union, inheritance, one-to-many, and many-to-many rules may incur space overheads. In Section 4, we describe our property graph schema optimization algorithms, trading off performance gain and space overhead.
Query Rewriting using Mappings. As described earlier, a mapping record is generated when the corresponding relationship rule is triggered. The record provides the mapping provenance from the original ontology to the optimized property graph schema. When a query arrives, the query rewriter searches the mapping records for the involved vertices and relationships in order to rewrite the query. Due to space constraints, a full-blown description of query rewriting is beyond the scope of this paper. We illustrate query rewriting with two query examples expressed in Cypher.
Example 3. The following query, requesting the contra-indication of each drug, contains an inheritance (isA) relationship, which has been removed from the property graph after applying the inheritance rule.
After query rewriting, the query is expressed as:
Example 4. The following query counts the number of effective dates of contracts that are managed by a corporation:
After applying the 1:M rule, this property has been replicated to Drug. Thus, the query is rewritten as:
4. Property Graph Schema Optimization
In this section, we first introduce a property graph schema optimization algorithm in an ideal scenario (i.e., no space constraints). Then, we describe our concept-centric and relation-centric algorithms that harness the proposed rules and a cost-benefit model to generate an optimized property graph schema for a given space constraint.
4.1. Optimization Without Space Constraints
To produce an optimized property graph schema, we need to determine how to utilize the proposed rules described in Section 3. A straightforward approach is to iteratively apply these rules in order and generate the property graph schema.
Specifically, Algorithm 5 takes as input an ontology and first computes the Jaccard similarity scores for all inheritance relationships (Lines 1-2). Then, it iteratively applies the appropriate rule to each relationship in the ontology (Lines 3-16). At the end of each iteration, it checks if the ontology converges (Line 17). Finally when no more rule applies, a property graph schema is generated (Lines 18-19). In fact, these rules can be applied in any order, and the generated property graph schema is always the same.
Theorem 1 ().
Applying the union, inheritance, 1:M and M:N rules in any order produces a unique , if there is no space constraint.
The detailed proof can be found in Appendix A.
4.2. Schema Optimization With Space Constraints
While the naïve approach harnesses all potential optimization opportunities aggressively, it incurs space overheads from union, inheritance, 1:M, and M:N rules. In cases where the number of such relationships is large in the ontology, this can be expensive with respect to the space consumption, especially in a cluster setting, where many large-scale property graphs co-exist. Hence our goal is to produce an optimized property graph schema for a given space limit. The quality and the space consumption of an optimized property graph schema are measured based on the total benefit and cost (i.e., space consumed) by applying the rules (given by Equations 3-5 in Section 4.2.2).
Definition 2 (Optimal Property Graph Schema).
Let be the set of all property graph schemas, such that we have , where is a given space budget. is an optimal property graph schema if such that Benefit() .
Finding an optimal property graph schema is exponential in the number of concepts and relationships in the ontology, which is practically infeasible. Hence, we need to design efficient heuristics to produce a near-optimal property graph schema. To achieve this goal, we propose two property graph schema optimization algorithms that leverage additional information such as data and workload characteristics.
Data characteristics contain the basic statistics about each concept, data property, and relationship specified in the given ontology. The statistics include the cardinality of data instances of each concept and relationship, as well as the data type of each data property. The data characteristics allow us to identify and prioritize the more beneficial relationships when applying union, inheritance, one-to-many and many-to-many rules, such that the space can be used more efficiently.
Access frequencies provide an abstraction of the workload in terms of how each concept, relationship, and data property accessed by each query in the workload. We use AF(.) to indicate the frequency of queries (the number of queries) that access a data property in from the concept through the relationship . The high frequency of a relationship indicates its relative importance among all relationships in the given ontology. Hence it is more imperative to apply the above proposed rules to these relationships with high frequency.
In case of no prior knowledge about access frequency, we assume that it follows a uniform distribution. Our approach can also handle updates (i.e., insert, delete, and modify) to the property graph if they do not incur any schema changes. If the accumulated updates change the data distributions, then we can apply the rules locally to the affected part of the ontology. Note that data statistics changes can invalidate certain rule applied earlier, or can trigger new rules, especially inheritance and union rules. We can make local adjustments to accommodate these changes. Minimizing such transformation overheads is left as future work.
4.2.1. Concept-Centric Algorithm
As described in Section 2, an ontology describes a particular domain and provides a concept-centric view over domain-specific data. Intuitively, some concepts are more critical to the domain, and have more relationships with the other concepts (Abdul:2018, ). We expect these key concepts to be queried more frequently than other concepts.
To determine these key concepts, we utilize centrality analysis over the ontology to rank all concepts according to their respective centrality score. The centrality analysis is based on the commonly used PageRank algorithm (Brin:1998, ) as its underlying assumption, more important websites likely to receive more links from other websites, is similar to our intuition of key concepts. Our modified PageRank algorithm, called (Algorithm 6), determines the centrality score of each concept in an ontology. Compared to PageRank, we further introduce weights for both in and out degrees of concepts in determining their centrality scores.
Inheritance. To cater for inheritance relationships, we remove these relationships from the ontology while running the initial PageRank algorithm. This allows us to calculate the page ranks of a concept based on the links from other concepts that are not children of the same concept. After computing the page rank values of all concepts, we re-attach these relationships and update the page ranks of each concept by doing a depth-first traversal over its inheritance relationships to find the parent with the highest page rank. If this value is higher than the current page rank of the concept, we use this value as the new page rank of the concept. This enables a child concept to inherit the page rank of its parent. The intuition is that a child concept inherits all its other properties from the same chain of concepts and hence would have a similar estimate of centrality.
Unions. The union concept in the ontology represents a logical membership of two or more concepts. Any incoming edge to a union concept can therefore be considered as pointing to at least one of the member concepts of the union. Similarly each outgoing edge can be considered as emanating from at least one of the member concepts.
To handle union concepts, the algorithm iterates over all incoming and outgoing edges to/from the union concept. For each incoming edge to the union concept, we create new edges between the source concept and each of the member concepts of the union. For each outgoing edge, similarly, we create new edges between the destination and each of the member concepts of the union. Thus the page rank mass is appropriately distributed to/from the member nodes of the union. Finally, the union node itself is removed from the graph as its contribution towards centrality analysis has already been accounted for by the new edges to/from the member concepts of the union.
Out-degree of Concepts. In the default PageRank algorithm, the weight distribution of the page rank is proportional to the in-degree of a node as it receives page rank values from all its neighbors that point to it. In other words nodes with a high in-degree would tend to have a higher page rank than nodes with a low in-degree. However, for a domain ontology, we observe that both in-degree and out-degree are equally important in terms of the key concept. Hence, we introduce a reverse edge in the ontology, essentially making the graph equivalent to an undirected graph. Then, the algorithm uses this modified ontology as an input to determine the centrality score of each concept.
Using , we associate PageRank scores with each concept in the ontology. To accurately capture the relative importance of the concepts, we further leverage the data characteristics and access frequency information to rank all concepts. Namely, the ranking score for a concept is defined as follows.
where denotes the PageRank score of , denotes the access frequency of including accessing all data properties of , and denotes the size of including all data properties of .
Based on Equation 2, our concept-centric algorithm (Algorithm 7) first sorts all concepts in a descending order of their respective scores (Lines 1-2). Then, it iterates through each concept (Lines 3-8). For each concept, the algorithm utilizes the procedure to apply all rules (Section 3) to the relationships connecting to . During this process, the algorithm updates the space limit as it is consumed by the rules. Once the space is fully exhausted, the algorithm terminates (Lines 7-8) and returns the optimized property graph schema (Line 10).
4.2.2. Relation-Centric Algorithm
Intuitively, the concept-centric algorithm prioritizes the relationships of the key concepts in an ontology by leveraging information such as access frequency, data characteristics, and structural information from the ontology. However, the relationship selection is limited to each concept locally. Namely, the concept-centric algorithm does not have a global optimal ordering among all relationships in the ontology. To address this issue, we propose the relation-centric algorithm based on a cost-benefit model for each type of relationships described as follows.
Cost Benefit Models. The union rule, introduced in Section 3, connects the member concept directly to all concepts that are connected to the union concept. Then, the benefit of applying this rule to a union relationship is the access frequency of , and the cost is the number of edges that we copy from the union concept to the member concept. Formally:
where denotes the union concept and denotes the number of edges between the instance vertices of and the ones of a neighborhood concept555The neighborhood concepts do not include the member concepts of . of .
The benefit of applying the inheritance rule to an inheritance relationship is the access frequency of that relationship multiplied by the Jaccard similarity between and . Depending on that similarity, the cost of inheritance rule can be either the number of new edges attached to the parent, or the number of new edges attached to the child. Formally:
where (, ) denotes the Jaccard similarity between and , indicates the data type size of (e.g., the size of INT, DOUBLE, STRING, etc.), ( ) denotes the space overheads incurred by propagating () to (), and ( ) denotes the space overhead incurred by connecting the neighbors of () to ().
Similarly, the cost-benefit model for one-to-many rule, leveraging both data characteristics and access frequency information, is described as:
where denotes the space overhead incurred by replicating as a data property of type to .
As described in Section 3, each M:N relationship is equivalent to two 1:M relationships. Thus, we first convert each M:N relationship in the ontology into two 1:M relationships, and then use Equation 5 to decide the cost-benefit for each of them. Potentially some of the original M:N relationships could be optimized for only one direction. This increases the flexibility of applying many-to-many rule such that more frequently accessed data properties can be propagated to the other end of the relationship.
With the cost and benefit scores, our goal is to select a subset of relationships in the ontology that maximize the total benefit within the given space limit. We map our relationship selection problem to the 0/1 Knapsack Problem, which is NP-hard (Vazirani:2001, ).
Proposition 0 (Reduction).
If both benefit and cost of a relationship are positive, then every instance of the relationship selection problem can be reduced to a valid instance of the 0/1 Knapsack problem.
We adopt the fully polynomial time approximation scheme (FPTAS) (Vazirani:2001, ) for our relation selection problem, which guarantees that the benefit of the optimized property graph schema is within 1- ( ¿ 0) bound to the benefit of the optimal property graph schema .
Algorithm 8 takes as inputs an ontology and the space limit. Similar to Algorithm 5, it computes the Jaccard similarity scores for all inheritance relationships (Lines 1-2). Then it computes the cost and benefit for each relationship in the ontology using Equations 3, 4, and 5 (Lines 3-6). Next, the FPTAS algorithm is used to select the near-optimal subset of relationships with the given space limit (Line 7). In applyRules procedure, the algorithm applies the corresponding rules; (Lines 8-9). Lastly, an optimized property graph schema is generated (Lines 10-11).
5. Experimental Evaluation
In this section, we present experiments to evaluate the effectiveness of our property graph schema design algorithms, and compare the query performance of different property graphs generated by different algorithms.
5.1. Experimental Setup
Infrastructure. We implemented our approach in Java with JDK 1.8.0 running on Ubuntu 14.04 with 16-core 3.4 GHz CPU and 128 GB of RAM. We choose two popular graph database systems, Neo4j (neo4j, ) and JanusGraph (janus, ), as our graph backends. We executed each experiment ten times and here we report their average.
Data sets. To evaluate the effectiveness of our system on different application domains, we use the following two data sets and their corresponding ontologies.
1. Financial data set (FIN) (blind:2018, ) includes data from two main sources: Securities and Exchange Commission (SEC) (SEC, ) and Federal Deposit Insurance Corporation (FDIC, ). The size of the data set is approximately 53 GB. The corresponding financial ontology contains 28 concepts, 96 properties, and 138 relationships. It contains financial entities, financial metrics, lender, borrower, investment relationships, the officers of the companies as well as their relationships, etc.
2. Medical data set (MED) contains medical knowledge that is used to support evidence-based clinical decision and patient education. The total size of this data set is around 12 GB. The corresponding medical ontology consists of 43 concepts, 78 properties, and 58 relationships.
Methodology and metrics. To evaluate the quality of the property graph schemas produced by our algorithms, we vary the space limit and the Jaccard similarity thresholds for inheritance relationships with two different workload summaries (uniform and Zipf). Specifically, we show how effectively leverages the given space limit, how robust is to various workloads, and how sensitive is to different similarity thresholds. chooses the property graph schema with a higher total benefit score from relation-centric () and concept-centric () algorithms. We measure the quality of a property graph schema by , where is the total benefit score of the property graph schema generated by Algorithm 5 without any space constraint, and indicates the total benefit score achieved by either or algorithm.
To verify the graph query performance, we express most graph queries in both Cypher (DBLP:conf/sigmod/FrancisGGLLMPRS18, ) and Gremlin (gremlin, ), including path, reachability, and graph analytical queries. Among these query types, we construct a variety of query workloads conforming to different workload distributions over both financial and medical data sets. We use latency as the metric to measure these graph queries. Latency is measured in milliseconds as the total time of all queries in a workload executed in sequential order. Lastly, we also evaluate the efficiency of our concept-centric and relation-centric algorithms with different space constraints.
5.2. Property Graph Schema Quality
Varying Space Constraint. In Figures 9 and 9, we focus on the quality of the property graph schema produced by our concept-centric () and relation-centric () algorithms compared to our method without space constraints (Algorithm 5). We choose two commonly seen workload summaries, uniform and Zipf distributions. The Zipf workload gives more access to the key concepts in the ontology. We first use to produce an optimal property graph schema without any space constraint, and then compute the total benefit score achieved by as well as the total amount of space needed by . We also compute the total amount of space needed by the direct mapping algorithm from the given ontology. We, then, vary the space constraint from to , such that the range of the Y-axis in Figures 9 and 9 is from 0 to 1. Figures 9 and 9 show results from MED and FIN data sets respectively.
In Figure 9, we observe that consistently outperforms with both uniform and Zipf workloads. The reason is that has a global ordering of all relationships, and the global ordering is near-optimal with respect to the given space constraint due to the adopted approximate Knapsack algorithm. On the contrary, suffers from a rather local optimal ordering with respect to each concept. Hence, it misses the opportunity to utilize the space for more beneficial relationships. Moreover, we observe that with approximately 20% of the maximum space constraint, both algorithms are able to produce high quality property graph schemas which achieve above 50% of the total benefit. In other words, both algorithms can effectively utilize the rather limited space. Lastly, both and produce the same property graph schema as when the space constraint reaches 100%, which substantiates Theorem 1.
Similarly, outperforms In Figure 9, as utilizes the space for one concept at a time, missing the opportunities for more beneficial relationships in the ontology. We also observe that both algorithms, with uniform and Zipf workloads, have a couple of drops when the space constraint increases. The reason is primarily due to the complexity of FIN ontology. Given that the inheritance relationships are more dominant in FIN, the given space may be exhausted quickly by certain inheritance relationships. Again, and produce the same property graph schema as with 100% space constraint.
Varying Jaccard Similarity. In Figure 10, we show the sensitivity of both and with respect to the Jaccard similarity thresholds ( and ). In this experiment, we choose FIN ontology because it consists of multiple inheritance relationships. Uniform and Zipf workload distributions are used to examine the robustness of our and algorithms. Note that the space constraint in this experiment is set to (-)/2 under each specific Jaccard similarity threshold. The reason is that the cost (space overhead) of the same inheritance relationship can vary (Eq. 4) depending on the similarity threshold. Consequently, the space consumption of the optimal property graph changes under different thresholds. As shown in Figure 10, both and are robust under different similarity thresholds. In the worst case, they achieve more than 70% of the maximum benefit score under 50% space constraint.
In summary, and produce high quality property graph schemas under various settings. They work effectively with any given space constraints. Moreover, always produces a near-optimal property graph schema and outperforms in most cases. Our property graph schema generator still leverages both algorithms to choose the property graph schema with the highest benefit score under any space constraints.
5.3. Graph Query Execution
In this section, we focus on the graph query execution performance over the property graphs created by our ontology-driven approach. We use both MED and FIN data sets to conduct our experiments. First, we create a micro benchmark to empirically examine whether the property graph schema from our approach can actually benefit a set of graph primitives including simple pattern matching, vertex property lookup, and aggregation on vertices. Second, we study the overall execution time for a given graph query workload by mixing the above graph primitives. We run the graph queries, expressed in Cypher and Gremlin, on Neo4j and JanusGraph, respectively. Note that our goal is not to compare the performance between two systems, rather to show that our schema optimization results in query performance improvements irrespective of the backend.
Microbenchmark Using Graph Primitives. With both MED and FIN data sets, we compare the query performance of the property graph created by the optimized graph schema (OPT) to the baseline property graph created by a direct mapping of the ontology (DIR). The following parameter settings are used to produce : Jaccard similarity thresholds , , and space constraint 0.5 (). All queries (-) are expressed against DIR and the query rewriter transforms them into the queries over OPT schema using the mapping file. Due to space limits, we only show a few representative queries used in the microbenchmark below.
As shown in Figure 11, the results are unequivocal. The optimized schema has significant advantages over the direct mapping schema for all types of queries. The graph pattern matching queries (-) report all matches of a sub-graph with 3 vertices and 2 edges in the property graph. Query execution times with our approach are at least 2.4 times faster than the direct mapping schema. The number of edge traversals on DIR is always 2 as the query is specified with 2 edges connecting 3 vertices. On the other hand, our property graph only requires at most 1 edge traversal as some of the neighbor vertices have been already merged with the starting vertices.
- are vertex property lookup queries. Both and are interested in a property of a vertex of a parent concept, and the starting vertex is a vertex of a child concept. starts from a vertex and looks for a property of its neighbor vertex. OPT has the property of type with the starting vertex, and is able to return the result without any edge traversal. looks for a property of the starting vertex. In this case, OPT and DIR have identical query performance as no edge traversal is required. Hence OPT takes advantage of having the property of the parent concept available at the starting vertex, and consequently returns the result without any edge traversals. Therefore, the query runs more than an order of magnitude slower on the property graph of DIR than the one on OPT in the worst case.
- are graph aggregation queries that involve traversal from one vertex to the other. They count the number of neighbors of the starting vertex. On average, the query execution time is an order of magnitude faster for OPT approach compared to DIR. Again, the reason is that the aggregation on the neighbor vertices can be instantaneously returned from the starting vertex. The above results suggest that using the proposed ontology-driven approach can bring significant benefits to a variety of graph queries.
Lastly, we observe that the performance gain on Neo4j is more substantial compared to JanusGraph (e.g., , , , etc.). This shows that disk-based graph systems (e.g., Neo4j) benefits much more from our techniques, as the optimized schema requires significantly less disk I/O. Namely, the graph system loads less number of vertices and edges into memory. We expect such benefit to become even greater when the size of the property graph increases.
Graph Query Workload Performance. To evaluate the runtime performance of the property graph schema generated by our approach, we first generate two query workloads, which follow a Zipf distribution in terms of the access frequency to the concepts in the ontology. The query workload consists of a mixed of 15 queries. These queries are in three types described in Figure 11. Based on the workloads, two optimized schemas (, ) are produced with the same parameter settings as in the previous experiment. We compare our optimized schemas to the direct mapping schemas (, ) on both JanusGraph and Neo4j. The total query latency is used to measure the performance on these property graphs corresponding to different schemas.
Figure 12 shows the total query latency in log scale. Both and offer significant performance boosts to the graph query workloads on both JanusGraph and Neo4j. In Figure 12(a), we observe that the total query latency on the optimized schema is around 7 and 22 times faster than the direct mapping one over MED and FIN, respectively. The winning margin is even bigger on Neo4j (Figure 12(b)). The total query latency on both optimized schema is approximately 2 orders of magnitude faster than the direct mapping. Based on these results, we verify that the designed rules for different types of relationships in the ontology are effective in terms of reducing edge traversals and consequently improving the graph query performance. Furthermore, we demonstrate that our approach can effectively utilize the given space constraint by leveraging information such as data distribution and workload summaries.
5.4. Efficiency of Property Graph Schema Algorithms
Finally, we study the execution time of our concept-centric and relation-centric algorithms (Table 2). First, we observe that both CC and RC produce an optimized property graph schema in less than one second with different space constraints (shown in Table 2 as percentages of the space consumed by Algorithm 5). The optimization time of both algorithms is negligible compared to an exhaustive search approach, which even failed to produce an optimal schema for MED after 3 hours. Second, neither of the algorithms is sensitive to the space constraint, since both algorithms have a polynomial time complexity with respect to the number of concepts and relationships in the given ontology. Third, RC is consistently faster than CC, and the performance difference is more significant in FIN. This is due to the cost of ontologyPR procedure being dominant in CC. It usually takes more iterations to converge when the ontology (i.e., FIN) is more complex.
6. Related Work
Schema optimization for improving query performance has been studied in the database community for decades (Codd:1970, ; Finkelstein:1988, ; Zilio:2004, ; 7498239, ). In recent years, the emergence of many large-scale knowledge graphs has drawn attention for schema optimization. In this section, we present important works in this field, highlighting the main differences to our approach.
Schema Optimization in RDBMS/NoSQL. Extensive work is available for schema design problem in relational database systems (Finkelstein:1988, ; Agrawal:2000, ; Zilio:2004, ; Bruno:2005, ; Kimura:2010, ; Dash:2011, ). RDBMSs provide a clean separation between logical and physical schemas. The logical schema includes a set of table definitions and determines a physical schema consisting of a set of base tables (Finkelstein:1988, ; Agrawal:2000, ; Zilio:2004, ). The physical layout of these base tables is then optimized with auxiliary data structures such as indexes and materialized views for the expected workload (Agrawal:2000, ; Kimura:2010, ). Typically, the physical design often involves identifying candidate physical structures and selects a good subset of these candidates (Dash:2011, ). NoSE (7498239, ) is introduced to recommend schemas for NoSQL applications. Its cost-based approach utilizes a binary integer programming formulation to generate a schema based on the conceptual data model from the application.
In principle, our approach is similar to the logical schema design in RDBMSs, which defers the physical design to the underlying graph systems. On the other hand, we are different from the above methods since the data modeling for graph is very different from the one for relational. Specifically, the graph structure results in more expressive data models than those produced using relational databases. Moreover, our approach leverages an ontology with rich semantic information to drive the schema optimization, which is not considered by any of the above work.
Schema Optimization in Knowledge Graphs. In the last few years, RDF has been growing significantly for expressing graph data. A variety of schemas have been proposed for physically storing graph data in both centralized and distributed settings (Huang2011ScalableSQ, ; DBLP:conf/www/MadukoASS07, ; DBLP:journals/vldb/NeumannW10, ; DBLP:conf/icde/NeumannM11, ; DBLP:conf/icde/MeimarisPMA17, ; DBLP:conf/sigmod/BorneaDKSDUB13, ; DBLP:conf/wise/HarrisS05, ; DBLP:journals/vldb/AbadiMMH09, ; DBLP:conf/vldb/ChongDES05, ). Some of these works focus on optimizing RDF data storage and SPARQL queries based on either workload statistics (DBLP:conf/www/MadukoASS07, ; DBLP:journals/vldb/NeumannW10, ; DBLP:conf/icde/NeumannM11, ; DBLP:conf/icde/MeimarisPMA17, ) or heuristics (DBLP:conf/edbt/TsialiamanisSFCB12, ). Other works (DBLP:conf/sigmod/BorneaDKSDUB13, ; DBLP:conf/wise/HarrisS05, ; DBLP:journals/vldb/AbadiMMH09, ; DBLP:conf/vldb/ChongDES05, ) attempt to transform RDF data into relational data and provide SPARQL views over relational schemas, leveraging the many years of experience in RDBMS schema optimization.
Recently, works such as (DBLP:conf/sigmod/SunFSKHX15, ; szarnyas2017incremental, ; DBLP:conf/edbt/HassanKJAS18, ) address a similar problem in the context of property graphs. GRFusion (DBLP:conf/edbt/HassanKJAS18, ) focuses on filling the gap between the relational and the graph models rather than optimizing the graph schema to achieve better query performance. Szárnyas et al. (szarnyas2017incremental, ) propose to use incremental view maintenance for property graph queries. However, their approach can only support a subset of property graph queries by using nested relational algebra. SQLGraph (DBLP:conf/sigmod/SunFSKHX15, ) introduces a physical schema design that combines relational storage for adjacency information with JSON storage for vertex and edge attributes. It also translates Gremlin queries into SQL queries in order to leverage relational query optimizers. However SQLGraph also focuses on physical schema design which only targets on the relational databases. The query translator is limited to Gremlin queries with no side effects. Our ontology-driven approach is different for the following reasons. First, our approach produces a high-quality schema applicable to any graph system compatible with property graph model and Gremlin or Cypher queries. Second, we exploit the rich semantic information in an ontology to guide the schema design. Last but not least, our approach can further leverage these techniques to decide how the property graph should be stored on different storage backends.
Fan et al. (DBLP:journals/tkde/FanWW16, ) propose to answer graph pattern queries using views. They assume that the views are given as inputs and choose a subset of views to answer a query. Hence the optimized schema generated from our approach can be considered as a view on the original property graph, which can be consumed by their technique.
To the best of our knowledge, our ontology-driven approach is the first to address the property graph schema optimization problem for domain-specific knowledge graphs. Our approach takes advantages of the rich semantic information in an ontology to drive the property graph schema optimization. The produced schemas gain up to 3 orders of magnitude graph query performance speed-up compared to a direct mapping approach in two real-world knowledge graphs.
-  Amazon neptune. https://aws.amazon.com/neptune/, March 2020.
-  Azure cosmos db. https://azure.microsoft.com/en-us/services/cosmos-db/, March 2020.
-  Federal deposit insurance corporation. https://www.fdic.gov/regulations/resources/call/index.html, March 2020.
-  Gremlin query language. https://tinkerpop.apache.org/gremlin.html, March 2020.
-  Janusgraph: Distributed graph database. http://janusgraph.org/, March 2020.
-  The neo4j graph platform. https://neo4j.com/, March 2020.
-  Owl 2 web ontology language document overview. https://www.w3.org/TR/owl2-overview/, March 2020.
-  Securities and exchange commission. https://www.sec.gov/dera/data/financial-statement-data-sets.html, March 2020.
-  D. J. Abadi, A. Marcus, S. Madden, and K. Hollenbach. Sw-store: a vertically partitioned DBMS for semantic web data management. VLDB J., 18(2):385–406, 2009.
-  S. Agrawal, S. Chaudhuri, and V. R. Narasayya. Automated selection of materialized views and indexes in sql databases. In VLDB, pages 496–505, 2000.
-  K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In ACM SIGMOD, pages 1247–1250, 2008.
-  M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, and B. Bhattacharjee. Building an efficient RDF store over a relational database. In ACM SIGMOD, pages 121–132, 2013.
-  S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW, pages 107–117, 1998.
-  N. Bruno and S. Chaudhuri. Automatic physical database tuning: A relaxation-based approach. In ACM SIGMOD, pages 227–238, 2005.
-  E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An efficient sql-based RDF querying scheme. In VLDB, pages 1216–1227, 2005.
-  V. Christophides, V. Efthymiou, and K. Stefanidis. Entity Resolution in the Web of Data. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool Publishers, 2015.
-  E. F. Codd. A relational model of data for large shared data banks. Commun. ACM, 13(6):377–387, June 1970.
-  D. Dash, N. Polyzotis, and A. Ailamaki. Cophy: A scalable, portable, and interactive index advisor for large workloads. PVLDB, 4(6):362–372, 2011.
-  A. Deutsch, Y. Xu, M. Wu, and V. Lee. Tigergraph: A native MPP graph database. CoRR, abs/1901.08248, 2019.
-  X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.
-  W. Fan, X. Wang, and Y. Wu. Answering pattern queries using views. IEEE Trans. Knowl. Data Eng., 28(2):326–341, 2016.
-  S. Finkelstein, M. Schkolnick, and P. Tiberio. Physical database design for relational databases. ACM Trans. Database Syst., 13(1):91–128, 1988.
-  N. Francis, A. Green, P. Guagliardo, et al. Cypher: An evolving query language for property graphs. In ACM SIGMOD, pages 1433–1445, 2018.
-  S. Harris and N. Shadbolt. SPARQL query processing with conventional relational database systems. In WISE, pages 235–244, 2005.
-  O. Hartig and J. Hidders. Defining schemas for property graphs by using the graphql schema definition language. In Proceedings of the 2Nd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), GRADES-NDA’19, pages 6:1–6:11, 2019.
-  M. S. Hassan, T. Kuznetsova, H. C. Jeong, W. G. Aref, and M. Sadoghi. Extending in-memory relational database engines with native graph support. In EDBT, pages 25–36, 2018.
-  J. Huang, D. J. Abadi, and K. Ren. Scalable sparql querying of large rdf graphs. PVLDB, 4:1123–1134, 2011.
-  H. Kimura, G. Huo, A. Rasin, S. Madden, and S. B. Zdonik. Coradd: Correlation aware database designer for materialized views and indexes. PVLDB, 3(1-2):1103–1113, 2010.
-  H. Kllapi, E. Sitaridi, M. M. Tsangaris, and Y. Ioannidis. Schedule optimization for data processing flows on the cloud. In ACM SIGMOD, pages 289–300, 2011.
-  J. Lehmann, R. Isele, M. Jakob, et al. Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 2015.
-  J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA, 2nd edition, 2014.
-  A. Maduko, K. Anyanwu, A. P. Sheth, and P. Schliekelman. Estimating the cardinality of RDF graph patterns. In WWW, pages 1233–1234, 2007.
-  M. Meimaris, G. Papastefanatos, N. Mamoulis, and I. Anagnostopoulos. Extended characteristic sets: Graph indexing for SPARQL query optimization. In ICDE, pages 497–508, 2017.
-  M. J. Mior, K. Salem, A. Aboulnaga, and R. Liu. Nose: Schema design for nosql applications. In ICDE, pages 181–192, May 2016.
-  T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In ICDE, pages 984–994, 2011.
-  T. Neumann and G. Weikum. The RDF-3X engine for scalable management of RDF data. VLDB J., 19(1):91–113, 2010.
-  N. Noy, Y. Gao, A. Jain, A. Narayanan, A. Patterson, and J. Taylor. Industry-scale knowledge graphs: Lessons and challenges. Commun. ACM, 62(8):36–43, July 2019.
-  A. Quamar, F. Özcan, and K. Xirogiannopoulos. Discovery and creation of rich entities for knowledge bases. In ExploreDB, 2018.
-  I. Robinson, J. Webber, and E. Eifrem. Graph Databases. O’Reilly Media, Inc., 2013.
-  S. Sakr and G. Al-Naymat. Relational processing of RDF queries: a survey. SIGMOD Record, 38(4):23–28, 2009.
-  J. Sen, F. Ozcan, A. Quamar, G. Stager, A. R. Mittal, M. Jammi, C. Lei, D. Saha, and K. Sankaranarayanan. Natural language querying of complex business intelligence queries. In SIGMOD, pages 1997–2000, 2019.
-  F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A large ontology from wikipedia and wordnet. Semantic Web, 2008.
-  W. Sun, A. Fokoue, K. Srinivas, A. Kementsietsidis, G. Hu, and G. T. Xie. Sqlgraph: An efficient relational-based property graph store. In ACM SIGMOD, pages 1887–1901, 2015.
-  G. Szárnyas. Incremental view maintenance for property graph queries. arXiv preprint arXiv:1712.04108, 2017.
-  P. Tsialiamanis, L. Sidirourgos, I. Fundulaki, V. Christophides, and P. A. Boncz. Heuristics-based query optimisation for SPARQL. In EDBT, pages 324–335, 2012.
-  O. van Rest, S. Hong, J. Kim, X. Meng, and H. Chafi. PGQL: a property graph query language. In Graph Data-management Experiences and Systems, page 7, 2016.
-  V. V. Vazirani. Approximation Algorithms. Springer-Verlag, Berlin, Heidelberg, 2001.
-  D. C. Zilio, J. Rao, S. Lightstone, et al. Db2 design advisor: Integrated automatic physical database design. In VLDB, pages 1087–1097, 2004.
Appendix A Proof Sketch of Theorem 1
Let = (, , ) be an ontology given as input to Algorithm 5, and let = (, , ) be the resulting ontology, which is used in Line 18 to produce the output . Proving Theorem 1 is equivalent to proving that applying the rules for any in any order will yield the same result . The theorem trivially holds when (), and when (only one rule can be triggered).
Base case. , i.e., for any two relationships, applying the rules in any order yields the same result. Since we only have two relationships, only two rules will be triggered if the relationships are of different types, or one rule will be triggered twice if the two relationships are of the same type. Therefore, we need to prove that applying each pair of rules in any order will yield the same results, examining every possible scenario for each rule.
Specifically, we need to prove that the following pairs of rules are order-independent: () union rule and inheritance rule, () inheritance rule and 1: rule, () union rule and 1: rule, () inheritance rule and : rule, () union rule and : rule, and () 1: rule and : rule.
(i) Union and Inheritance. To prove that union and inheritance rules are order-independent, we examine all the cases in which those two rules may be triggered in the same graph, as shown in Figure 13(a), (b), (c). We assume that the Jaccard similarity between the two concepts connected with an inheritance relationship is less than (see Algorithm 2), so the inheritance rule is triggered and the properties of the parent concept are copied to the child concept. It is straightforward to apply the following observations to the case in which the Jaccard similarity is greater than as well. Figure 13 contains more than two relationships, but only two relationships are sufficient to prove the case666Consider only the relationships (, ), (, ) for Figure 13(a), (, ), (, ) for Figure 13(b), and (, ), (, ) for Figure 13(c).. The additional relationships shown are for illustration purpose only.
In the trivial case of Figure 13(a), the source and destination concepts of the union and inheritance relationships are not inter-connected. If we apply the union rule first, we will end up with the left part of Figure 13(d), leaving the right part of Figure 13(a) unchanged, and if we apply the inheritance rule first, we end up with the right part of Figure 13(d), leaving the left part of Figure 13(a) unchanged. In both cases, applying the second rule generates the graph of Figure 13(d).
The case shown in Figure 13(b) is more complex, where the same concept () corresponds to a union concept and a child concept. Applying the union rule first, we remove and connect its member concepts and to through inheritance relationships. Note that those inheritance relationships come with the same Jaccard value as the original one connecting to , which we have assumed to be less than . Then, the inheritance rule is triggered, removing , copying its properties to its new children and , and connecting them to , as shown in Figure 13(e). If we apply inheritance first, instead of union, then we first remove , copy its properties to and connect to . Then, applying the union rule, we remove and connect the member concepts and to , again resulting in the graph of Figure 13(e). The same observations hold for the case in which corresponds to a parent concept and a union concept.
In a similar way, we can show that union and inheritance rules are order-independent in the case of Figure 13(c), in which the same concept () corresponds to a member concept and a parent concept. If we apply the union rule first, we remove and connect the member concepts and to . Then, applying the inheritance rule, we remove , copy its properties to , and connect to , resulting in the graph of Figure 13(f). If we apply the inheritance rule first, we remove , copy its properties to , and connect to through a union relationship. Finally, we apply the union rule and remove , connecting to and , also resulting in the graph of Figure 13(f).
(ii) Inheritance and 1:. We follow a similar strategy to prove that inheritance and 1: rules are order-independent, enumerating all possible cases in which those two rules may be triggered in the same graph, as shown in Figure 14(a), (b), (c), (d). This time, as well as in all the remaining cases () - (), the proof is simpler, since there is no alternative intermediate graph involved, if we follow one rule first or another. The only difference is in the set of properties attached to each concept. Again, we assume that the Jaccard similarity between the two concepts connected with an inheritance relationship is less than , so the inheritance rule is triggered and the properties of the parent concept are copied to the child concept.
We skip the trivial case in which the inheritance and 1: relationships are not related, and start with the case depicted in Figure 14(a), where the parent concept is also the source concept of an 1: relationship. If we apply inheritance first, then we copy the properties of to , remove and connect to through a 1: relationship. Then, we apply the 1: rule and copy ’s properties to , resulting in the graph of Figure 14(e). If we apply the 1: rule first, then we first copy the properties of to and then we apply inheritance to copy the properties of (also including the properties of ) to , remove and connect to through a 1: relationship, resulting again in the graph of Figure 14(e).
In the case of Figure 14(b), the parent concept () is now also the destination of an 1: relationship. If we apply inheritance first, then we copy the properties of to , remove and connect to through a 1: relationship. Then, we apply the 1: rule and copy ’s properties to , resulting in the graph of Figure 14(f). If we apply the 1: rule first, then we first copy the properties of to and then we apply inheritance to copy the properties of to , remove and connect to through a 1: relationship. Finally, we apply 1: rule again (remember that Algorithm 5 iterates until convergence) and and copy the properties of to , again resulting in the graph of Figure 14(f).
In Figure 14(c), is a child and a source concept of a 1: relationship. In short, if we apply inheritance first, we remove and copy its properties to and then we apply 1: and also copy the properties of to , resulting in Figure 14(g). If we apply 1: first, we copy the properties of to and then we apply inheritance to copy the properties of to and remove , again resulting in Figure 14(g).
Finally, in Figure 14(d), is a child and a destination concept of a 1: relationship. If we apply inheritance first, we remove and copy its properties to and then we apply 1: and copy the properties of (including the properties of ) to , resulting in the graph of Figure 14(h). If we apply 1: first, we copy the properties of to and then we apply inheritance to copy the properties of to and remove . Again, we need to trigger the 1: rule once more to copy the properties of , now also including the properties of , to and get the graph of Figure 14(h).
For the remaining pairs of rules () – (), we can follow the same strategy and prove that they are order-independent for all possible cases.
Induction hypothesis. Applying the rules in any order for any , where =, always results in the same .
Then, applying the rules in any order for any , such that = +1 and , will always result in the same , since there is only one additional relationship in compared to , and only one possible rule corresponding to this new relationship can be triggered. ∎
Appendix B 0/1 Knapsack Problem Reduction
Given an instance of 0/1 Knapsack problem, our reduction produces the following instance of relationship selection: the cost of relationship is set to , and the benefit of relationship is set to as well. We set the space limit to . Clearly this reduction runs in polynomial time.
If we started with a YES instance of 0/1 Knapsack, then we claim that the reduction produces a YES instance of relationship selection. Suppose there exists a subset for which is maximized and . Then selecting the relationship in has total benefit and weight no greater than , so the instance of relationship selection produced by the reduction is a YES instance.
If the reduction produces a YES instance of relationship selection, then we claim that () is a YES instance of 0/1 Knapsack. Let be the selected relationships, whose total benefit is and whose total cost is at most . In other words, we have and . We conclude that () is a YES instance of Knapsack problem as required. ∎