Hierarchical Characteristic Set Merging for Optimizing SPARQL Queries in Heterogeneous RDF

09/07/2018 ∙ by Marios Meimaris, et al. ∙ 0

Characteristic sets (CS) organize RDF triples based on the set of properties characterizing their subject nodes. This concept is recently used in indexing techniques, as it can capture the implicit schema of RDF data. While most CS-based approaches yield significant improvements in space and query performance, they fail to perform well in the presence of schema heterogeneity, i.e., when the number of CSs becomes very large, resulting in a highly partitioned data organization. In this paper, we address this problem by introducing a novel technique, for merging CSs based on their hierarchical structure. Our technique employs a lattice to capture the hierarchical relationships between CSs, identifies dense CSs and merges dense CSs with their ancestors, thus reducing the size of the CSs as well as the links between them. We implemented our algorithm on top of a relational backbone, where each merged CS is stored in a relational table, and we performed an extensive experimental study to evaluate the performance and impact of merging to the storage and querying of RDF datasets, indicating significant improvements.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent works in the state of the art in RDF data management have shown that extraction and exploitation of the implicit schema of the data can be beneficial in both storage and SPARQL query performance [14][13][8][9]. In order to organize on disk, index and query triples efficiently, these trends heavily rely on two structural components of an RDF dataset, namely (i) the notion of characteristic sets (CS), i.e., different property sets that characterize subject nodes, and (ii) the join links between CSs. For the latter, in our previous work, we introduced Extended Characteristic Sets (ECS)[8]

, which are typed links between CSs that exist only when there are object-subject joins between their triples, and we showed how RDF data management can rely extensively on CSs and ECSs for both storage and indexing, yielding significant performance benefits in heavy SPARQL workloads. However, this approach failed to address schema heterogeneity in loosely-structured datasets, as this implied a large number of CSs and ECSs (e.g., Geonames contains 851 CSs and 12136 CS links), and thus, skewed data distributions that impose large overheads in the extraction, storage and disk-based retrieval

[14][8].

In this paper, we exploit the hierarchical relationships between CSs, as captured by subsumption of their respective property sets, in order to merge related CSs. We follow a relational implementation approach by storing all triples corresponding to a set of merged CSs into a separate relational table and by executing queries through a SPARQL to SQL transformation. Although, alternative storage technologies can be considered (key-value, graph stores,etc), we have selected well-established technologies and database systems for the implementation of our approach, in order to take advantage of existing data indexing and query processing techniques that have been proven to scale efficiently in large and complex datasets. To this end, we present a novel system, named raxonDB, that exploits these hierarchies in order to merge together hierarchically related CSs and decrease the number of CSs and the links between them, resulting in a more compact schema with better data distribution. The resulting system, built on top of PostgreSQL, provides significant performance improvements in both storage and query performance of RDF data.

In short, our contributions are as follows:

  • We introduce a novel CS merging algorithm that takes advantage of CS hierarchies,

  • we implement raxonDB, an RDF engine built on top of a relational backbone that takes advantage of this merging for both storing and query processing,

  • we perform an experimental evaluation that indicates significant performance improvements for various parameter configurations.

2 Related Work

RDF data management systems generally follow three storage schemes, namely triples tables, property tables, and vertical partitioning. A triples table has three columns, representing the subject, predicate and object (SPO) of an RDF triple. This technique replicates data in different orderings in order to facilitate sort-merge joins. RDF-3X [11] and Hexastore [18] build tables on all six permutations of SPO. Built on a relational backbone, Virtuoso [4] uses a 4-column table for quads, and a combination of full and partial indexes. These methods work well for queries with small numbers of joins, however, they degrade with increasing sizes, unbound variables and joins.

Property Tables places data in tables with columns corresponding to properties of the dataset, where each table identifies a specific resource type. Each row identifies a subject node and holds the value of each property. This technique has been implemented experimentally in Jena [19] and DB2RDF [3], and shows promising results when resource types and their properties are well-defined. However, this causes extra space overhead for null values in cases of sparse properties [1]. Also, it raises performance issues when handling complex queries with many joins, as the amounts of intermediate results increase [7].

Vertical partitioning segments data in two-column tables. Each table corresponds to a property, and each row to a subject node [1]. This provides great performance for queries with bound objects, but suffers when the table sizes have large variations in size [17]. TripleBit [20] broadly falls under vertical partitioning. In TripleBit, the data is vertically partitioned in chunks per predicate. While this reduces replication, it suffers from the same problems as property tables. It does not consider the inherent schema of the triples in order to speed up the evaluation of complex query patterns.

In distributed settings, a growing body of literature exists, with systems such as Sempala [15], H2RDF [12] and S2RDF [16]. However, these are based on parallelization of centralized indexing and query evaluation schemes.

For these reasons, latest state of the art approaches rely on implicit schema detection in order to derive a hidden schema from RDF data and index/store triples based on this schema. Furthremore, due to the tabular structure that tends to implicitly underly RDF data, recent works have been implemented in relational backbones. In our previous work [8], we defined Extended Characteristic Sets (ECSs) as typed links betwen CSs, and we showed how ECSs can be used to index triples and greatly improve query performance. In [14], the authors identify and merge CSs, similar to our approach, into what they call an emergent schema. However, their main focus is to extract a human-readable schema with appropriate relation labelling. They do not use hierarchical information of CSs, rather they use semantics to drive the merging process. In [13] it is shown how this emergent schema approach can assist query performance, however, the approach is limited by the constraints of human-readable schema discovery. In our work, query performance, indexing and storage optimization are the main aims of the merging process, and thus we are not concerned about providing human-readable schema information or any form of schema exploration. In [9]

, the authors use CSs and ECSs in order to assist cost estimation for federated queries, while in

[5], the authors use CSs in order to provide better triple reordering plans. To the best of our knowledge, this is the first work to exploit hierarchical CS relations in order to merge CSs and improve query performance.

3 Hierarchical CS Merging

3.1 Preliminaries

The RDF model does not generally enforce structural rules in the representation of triples; within the same dataset there can be largely diverse sets of predicates emitting from nodes of the same semantic type [8, 14, 10]. Characteristic Sets (CS)[10] capture this diversity by representing implied node types based on the set of properties they emit. Formally, given a collection of triples , and a node , the characteristic set of is .

The set of properties of a CS is denoted with . Furthermore, in a given dataset, each CS represents a set of records identified by a subject node, and all of the values of the subject node (i.e., objects) for the predicates in . We denote the set of all records of as , while is represented by a relational table that is defined by these two elements, i.e., . The tuples in are of the form , where is the identifier column (e.g. URI) of a subject node and are the values, i.e. object nodes, of the properties in for . In the context of this paper, with the term Characteristic Set we will refer collectively to the properties and records of a CS, i.e., its relational table, rather than just the set of properties proposed in the original definition, for the sake of simplicity.

Within a given dataset, CSs often exhibit hierarchical relationships, as a result of the overlaps in their comprising sets of properties. For example, consider two CSs, , describing human beings, with and . It can be seen that and thus is a parent of . This relationship entails an overlap of properties that define the CSs, and can be exploited in order to provide a means to merge common CSs based on the specialization or generalization of the node types they describe. In what follows, we formally define the notions of CS subsumption, hierarchy and ancestral sub-graphs.

Definition 1. (CS Subsumption). Given two CSs, and , and their property sets and , then subsumes , or , when the property set of is a proper subset of the property set of , or . This subsumption forms parent-child relationships between CSs. CS subsumption relationships can be seen in Figure 1(a) as directed edges between nodes. The set of all parent-child relationships defines a CS hierarchy as defined in the following.

Definition 2. (CS Hierarchy and Inferred Hierarchy). CS subsumption creates a partial ordering that essentially defines a hierarchy such that when , then is a parent of . Formally, a CS hierarchy is a graph lattice where and . A directed edge between two CS nodes exists in , when and there exists no other such that . An example CS hierarchy can be seen in Figure 1(a). Given a hierarchy , we denote the hierarchical closure of with , so that extends to contain inferred edges between hierarchically related nodes that are not consecutive, e.g. a node and its grandchildren. An example inferred hierarchy can be seen in Figure 1(c) for a sub-graph of the graph in Figure 1(a), with the inferred relationships in dashed lines. In the remainder of this paper, we refer to as the inferred hierarchy of .

Definition 3. (CS Ancestral Sub-graphs). Given an inferred hierarchy , a CS and set of CSs , then is an ancestral sub-graph with as the lowermost child when , it holds that , and . This means that any sub-graph with as a sink node will be an ancestral sub-graph of . Thus, it holds that . For instance, in Figure 1(c), nodes form an ancestral sub-graph with as the base CS. Similarly, nodes and form ancestral sub-graphs with as base CS.

Logically, we map each CS to a relational table, so that for a CS we create a relational table , where is the id of the subject and are the properties that belong to , and then we use the CS hierarchy in order to merge the nodes of an ancestral sub-graph with as base into a single table. Specifically, we exploit the property set overlap in order to merge together smaller parent CSs with larger child CSs, in order to minimize the effect of NULL values that will appear for properties in smaller CSs that do no exist in the larger CSs. Thus, will be the most specialized CS in its ancestral sub-graph. For this reason, we define a merge operator, , as follows.

Definition 4. (Hierarchical CS Merge). Given an ancestral sub-graph , where as defined above, then a hierarchical merge of is given as follows: , where . Here, is the most specialized property set in , as does not have any children in , while is the UNION of the records of all CSs in , where is the projection of on . This means that will contain NULL values for all the non-shared properties of and , i.e., . In essence, is an edge contraction operator that merges all nodes of an ancestral sub-graph into one, while removing the edges that connect them. For instance, assume that is the set of vertices of an ancestral sub-graph with three CSs, with , and . Thus, . Hierarchical merging can be seen in Figure 2.

Definition 5. (Merge Graph). Given an inferred CS hierarchy , a merge graph is a graph that consists of a set of ancestral sub-graphs, and has the following properties: (i) contains all nodes in such that , i.e., it covers all CSs in the input dataset, (ii) contains a subset of the edges in such that , (iii) each node is contained in exactly one ancestral sub-graph , (iv) all ancestral sub-graphs are pair-wise disconnected, i.e., there exist no edges between the nodes of different ancestral sub-graphs. Thus, each ancestral sub-graph can be contracted into one node unambiguously, using the operator. Also, the total number of relational tables will be equal to the number of ancestral sub-graphs in the merge graph.

Figure 1: (a) A CS hierarchy graph with dense nodes colored in deep purple, (b) the connected components derived by cutting off descendants from dense nodes, (c) a connected component with dashed lines representing inferred hierarchical relationships, (d) all possible assignments of dense nodes to non-dense nodes.

Problem Formulation. Given an inferred CS hierarchy , the problem is to find a merge graph in the form of a set of disconnected ancestral sub-graphs, that provides an optimal way to merge CS nodes. In other words, the problem is to find the best set of ancestral sub-graphs from an inferred hierarchy that minimize an objective cost function , or more formally:

(1)

This formulation entails several problems. First, the notion of cost depends on possibly subjective factors, such as the query workload, the storage technology, the input dataset and so on. There is no universal cost model that can be deployed in order to assess the effectiveness of a merge graph. Moreover, neither the number of ancestral sub-graphs, nor the set of sub-graph roots is known as part of the input. A CS hierarchy of nodes can potentially create sub-graphs, while the number of possible sub-graph roots is also exponential with the respect to the hierarchy size. Thus, given an arbitrary cost function, this is a problem of non-uniform graph partitioning on the inferred hierarchy

, which is known to be NP-Hard. That is, even with a deployed cost model, it is still an exponential problem to enumerate all possible sub-graphs and find the one with the minimum cost. For these reasons, we approach the problem by deploying a set of rules and heuristics that find a good merge graph efficiently and offer improved storage and query performance, as will be shown in the experiments.

3.2 CS Retrieval and Merging

The primary focus of this work is to improve the efficiency of the storage and query capabilities of relational RDF engines by exploiting the implicit schema of the data in the form of CSs. However, CS merging results in several problems that need to be addressed in this context. These are discussed in what follows.

First, the problem of selecting ancestral sub-graphs is a computationally hard one, as mentioned earlier. For this reason, we rely on a simple heuristic in order to seed the process and provide an initial set of ancestral sub-graph sink nodes, that will form the bases of the final merged tables, as defined in Definition 3. For this, we identify dense CS nodes in the hierarchy (i.e, with large cardinalities) and use these nodes as the bases of the ancestral sub-graphs. While node density can be defined in many different ways, in the context of this work we define a to be dense, if its cardinality is larger than a linear function of the maximum cardinality of CSs in , i.e., a function , with . Here, is called the density factor, and is the cardinality of the largest CS in . This means that, by definition, if , no CSs will be merged, because all CSs will be considered dense and thus each CS will define its own ancestral sub-graph, while if , all no ancestral sub-graph will be defined, and all CSs will be merged to one large table, as no CS has a cardinality larger than that of the largest CS. With a given , the problem is reduced to finding the optimal ancestral sub-graph for each given dense node.

Second, merging tables results in the introduction of NULL values for the non-shared columns, which can degrade performance. Specifically, merging CSs with different attribute sets can result in large numbers of NULL values in the resulting table. Given a parent CS and a child CS with and , the resulting NULL cells will be significantly large compared to the total number of records, thus potentially causing poor storage and querying performance[14]. For this reason, CS merging must be performed in a way that will minimize the presence of NULL values. The following function captures the NULL-value effect of the merge of two CSs with :

(2)

Intuitively, represents the ratio of null values to the cardinality of the base CS in the merge. The numerator of the fraction represents the total number of cell values that will be null, as the product of the number of non-shared properties and the cardinality of the parent CS. The denominator represents the cardinality of the base CS. Hence, the base CS must be a descendant (i.e., with more properties) in order to minimize the presence of NULLs.

In order to assess an ancestral sub-graph, we use a generalized version of that captures the NULL value effect on the whole sub-graph:

(3)

Here, is the dense root of sub-graph . However, merging a parent to a dense child changes the structure of the input graph, as the cardinality of the dense node is increased. To accommodate this, we define a cost function that works on the graph level, as follows:

(4)

where is the number of dense nodes, is a dense node and is the ancestral sub-graph with as the base node.

Given this cost model and a pre-defined set of dense nodes, our exact algorithm will find the optimal sub-graph for each dense node. An inferred hierarchy graph can be converted to a set of connected components that are derived by removing the outgoing edges from dense nodes, since we are not interested in merging children to parents, but only parents to children. An example of this can be seen in Figure 1(b). For each component, we can compute as the sum of the costs of these components. The main idea is to identify all connected components in the CS graph, iterate through these components, enumerate all sub-graphs within the components that start from the given set of dense nodes, and select the optimal partitioning for each component.

The algorithm can be seen in Algorithm 1. The algorithm works by first identifying all connected components of the inferred hierarchy (Line 2). Identifying connected components is trivially done using standard DFS traversal, and is not shown in the Algorithm. Then, we iterate each component (Line 3), and for each component, we generate all possible sub-graphs. Then, we calculate the cost of each sub-graph (Line 7) and if it is smaller than the current minimum, the minimum cost and best sub-graph are updated (Lines 8-9). Finally, we add the best sub-graph to the final list (Line 11) and move to the next component.

Figure 2: Merging the tables of , and .

To generate the sub-graphs, we do not need to do an exhaustive generation of combinations, but we can rely on the observation that each non-dense node must be merged to exactly one dense node. Therefore, sub-graph generation is reduced to finding all possible assignments of dense nodes to the non-dense nodes. An example of this can be seen in figure 1. In the figure, nodes are non-dense, while nodes are dense. All possible and meaningful sub-graphs are enumerated in the table at the right of the figure, where we assign a dense node to each of the non-dense nodes. An assignment is only possible if there exists a parent-child relationship between a non-dense node and a dense node, even if it is an inferred one (e.g. is an inferred parent of ). Hence, the problem of sub-graph generation becomes one of generating combinations from different lists, by selecting one element from each list. The number of lists is equal to the number of non-dense nodes, and the elements of each list are the dense nodes that are related to the non-dense node.

Complexity Analysis. Assuming that a connected component has non-dense nodes and dense nodes, and each non-dense node is related to dense nodes, then the number of sub-graphs that need to be enumerated are . In the example of figure 1, the total number of sub-graphs is . In the worst case all nodes are parents of all nodes. Then, the number of total sub-graphs is , which makes the asymptotic time complexity of the algorithm .

Data: An inferred hierarchy lattice as a adjacency list , and a set of dense CSs
Result: A set of optimal ancestral sub-graphs
1 init ;
2 ;
3 for each  do
4       init ;
5       init ;
6       while  do
7             if  then
8                   ;
9                   ;
10                  
11            
12       end while
13      ;
14      
15 end for
16return ;
Algorithm 1 optimalMerge

3.3 Greedy Approximation

For very small (e.g. ), the asymptotic complexity of is acceptable. However, in real-world cases, the number of connected components can be small, making large. For this reason, we introduce a heuristic algorithm for approximating the problem, that does not need to enumerate all possible combinations, but instead relies on a greedy objective function that attempts to find the local minimum with respect to our defined cost model for each non-dense node. Note that it lies beyond the scope of this work to compute the degree of approximation to the optimal solution, however, in our experiments, the heuristic solution is shown to provide significant performance gains.

The main idea behind the algorithm is to iterate the non-dense nodes, and for each non-dense node, calculate the function and find the dense node that minimizes this function for the given non-dense node. Then, the cardinalities will be recomputed and the next non-dense node will be examined. The algorithm can be seen in Algorithm 2. In the beginning, the algorithm initiates a hash table, , with an empty list for each dense node (Lines 1-4). Then, the algorithm iterates all non-dense nodes (Line 5), and for each dense node, it calculates the cost of merging it to each of its connected dense nodes (Lines 5-13), keeping the current minimum cost and dense node. In the end, the current non-dense node is added to the list of the dense node that minimizes (Line 14). Notice that we do not need to split the hierarchy into connected components in order for to work.

Complexity Analysis. Given non-dense nodes and dense nodes, where each non-dense node is related to dense nodes, the algorithm needs iterations, because we need to iterate all nodes for each . In the worst case, every is related to all dense nodes, requiring iterations. Assuming a constant cost for the computation of , then the asymptotic complexity of the greedy algorithm is , which is a significant performance when compared to the exponential complexity of .

Data: A hash table mapping non-dense CSs to their dense descendants, a set of dense CSs , and a set of non-dense CSs
Result: A hash table mapping dense CSs to sets of non-dense CSs to be merged
1 init ;
2 for each  do
3       ;
4      
5 end for
6for each  do
7       ;
8       init ;
9       for each  do
10             ;
11             if  then
12                   ;
13                   ;
14                  
15            
16       end for
17      ;
18      
19 end for
20return ;
Algorithm 2 greedyMerge

Obviously, this process does not necessarily cover all CSs of the input dataset. The percentage of the dataset that is covered by this process is called dense CS coverage. The remainder of the CSs that are not contained by any merge path are aggregated into one large table containing all of their predicates. If the total coverage of the merging process is large, then this large table does not impose a heavy overhead in query performance, as will be shown in the experiments. Finally, we load the data in the corresponding tables.

Figure 3: An example of greedy merging. Dense nodes are coloured in deep purple. At each step, the non-dense node under examination is coloured with green, while the edge that minimizes can be seen in bold.

3.4 Implementation

We implemented raxonDB as a storage and querying engine that supports hierarchical CS merging, and can be deployed on top of standard RDBMS solutions. Specifically, we used PostgreSQL 9.6, but raxonDB can be adapted for other relational databases as well. The architecture of raxonDB can be seen in Figure 4.

CS Retrieval and Merging. The processes of retrieving and merging CSs take place during the loading stage of an incoming RDF dataset. CS retrieval is a trivial procedure that requires scanning the whole dataset and storing the unique sets of properties that are emitted from the subject nodes in the incoming triples, and is adopted from our previous work in [8] where it is described in detail. After retrieving the CSs, the main idea is to compute the inferred CS hierarchy and apply one of the described merging algorithms. Finally, each set of merged CSs is stored in a relational table. In each table, the first column represents the subject identifier, while the rest of the columns represent the union of the property sets of the merged CSs. For multi-valued properties, we use PostgreSQL’s array data type in order to avoid duplication of the rows.

Indexing. We deploy several indexes in raxonDB. First off, we index the subject id for each row. We also build foreign-key indexes on object-subject links between rows in different CSs, i.e., when a value of a property in one CS is the subject id of another CS. Next, we use standard B+tree for indexing single-valued property columns, while we use PostgreSQL’s GIN indexes, which apply to array datatypes for indexing multi-valued properties. This enables fast access to CS chain queries, i.e., queries that apply successive joins for object-subject relationships. Furthermore, we store these links on the schema level as well, i.e., we keep an index of CS pairs that are linked with at least one object-subject pair of records. These links are called Extended Characteristic Sets (ECSs) and are based on our previous work in [8]. With the ECS index, we can quickly filter out CSs that are guaranteed not to be related, i.e., no joins exist between them, even if they are individually matched in a chain of query CSs. Other metadata and indexes include the property sets of CSs, and which properties can contain multiple values in the same CS.

Figure 4: Architecture of raxonDB.

Query Processing. Processing SPARQL queries on top of merged CSs entails (i) parsing the queries, (ii) retrieving the query CSs, (iii) identifying the joins between them, and (iv) mapping them to merged tables in the database. Steps (i)-(iii) are inherited from our previous work in [8]. For (iv), a query CS can match with more than one table in the database. For instance, consider a query containing a chain of three CSs, , joined sequentially with object-subject joins. Each query CS matches with all tables whose property sets are supersets of the property set of . Thus, each join in the initial query creates a set of permutations of table joins that need to be evaluated. For instance, assume that matches with , while matches with and matches with . Furthermore, by looking up the ECS index, we derived that the links , , and are all valid, i.e., they correspond to candidate joins in the data. Then, , , and are all valid table permutations that must be processed. Two strategies can be employed here. The first is to join the UNIONs of the matching tables for each , and the other is to process each permutation of tables separately and append the results. Given the filtering performed by the ECS indexing approach, where we can pre-filter CSs based on the relationships between them, the UNION would impose significant overhead and eliminate the advantage of ECS indexing. Therefore, we have implemented the second approach, that is, process a separate query for each permutation. Finally, due to the existence of NULL values in the merged tables, we must add explicit IS NOT NULL restrictions for all the properties that are contained in each matched CS and are not part of any other restriction or filter in the original query.

4 Experimental Evaluation

We implemented raxonDB on top of PostgreSQL111The code and queries are available in https://github.com/mmeimaris/raxonDB. We did not extend our previous native RDF implementation of axonDB [8], because given the underlying relational schema of the CS tables, we decided to rely on a well-established relational engine for both the planning and the execution of queries, instead of re-implementing it. As the focus of this paper is to improve RDF storage and querying efficiency in relational settings, we rely on existing mechanisms within PostgreSQL for I/O operations, physical storage and query planning. In this set of experiments, we report results of implementing with the greedy approximation algorithm, as experimenting with the optimal algorithm failed to finish the merging process even in datasets with small numbers of CSs.

Datasets. For this set of experiments, we used two synthetic datasets, namely LUBM2000 (300m triples), and WatDiv (100m triples), as well as two real-world datasets, namely Geonames (170m triples) and Reactome (15m triples). LUBM [6] is a customizable generator of synthetic data that describes academic information about universities, departments, faculty, and so on. Similarly, WatDiv[2] is a customizable generator with more options for the production and distribution of triples to classes. Reactome222http://www.ebi.ac.uk/rdf/services/reactome is a biological dataset that describes biological pathways, and Geonames333http://www.geonames.org/ontology/documentation.html is a widely used ontology of geographical entities with varying properties. Geonames maintains a rich graph structure as there is a heavy usage of hierarchical area features on a multitude of levels.

Loading. In order to assess the effect of hierarchical merging in the loading phase, we performed a series of experiments using all four datasets. For this experiment, we measure the size on disk, the loading time, the final number of merged tables, as well as the number of ECSs (joins between merged tables) and the percentage of triple coverage by CSs included in the merging process, for varying values of the density factor . The results are summarized in Table 1. As can be seen, the number of CS, and consequently tables, is greatly reduced with increasing values of . As the number of CSs is reduced, the expected number of joins between CSs is also reduced, which can be seen in the column that measures ECSs. Consequently, the number of tables can be decreased significantly without trading off large amounts of coverage by dense CSs, i.e. large tables with many null values. Loading time tends to be slightly greater as the number of CSs decreases, and thus the number of merges increases, the only exception being WatDiv, where loading time is actually decreased. This is a side-effect of the excessive number of tables () in the simple case which imposes large overheads for the persistence of the tables on disk and the generation of indexes and statistics for each one.

(a) Execution time (seconds) for LUBM
(b) Execution time (seconds) for Geonames
(c) Execution time (seconds) for Reactome
Figure 5: Query execution times in milliseconds
(a) # of CS permutations for LUBM
(b) # of CS permutations for Geonames
(c) # of CS permutations for Reactome
Figure 6: # of CS permutations for increasing m
Dataset Size (MB) Time # Tables (CSs) # of ECSs Dense CS
Coverage
Reactome Simple 781 3min 112 346 100%
Reactome (m=0.05) 675 4min 35 252 97%
Reactome (m=0.25) 865 4min 14 73 77%
Geonames Simple 4991 69min 851 12136 100%
Geonames (m=0.0025) 4999 70min 82 2455 97%
Geonames (m=0.05) 5093 91min 19 76 87%
Geonames (m=0.1) 5104 92min 6 28 83%
LUBM Simple 591 3min 14 68 100%
LUBM (m=0.25) 610 3min 6 21 90%
LUBM (m=0.5) 620 3min 3 6 58%
WatDiv Simple 4910 97min 5667 802 100%
WatDiv (m=0.01) 5094 75min 67 99 77%
WatDiv (m=0.1) 5250 75min 25 23 63%
WatDiv (m=0.5) 5250 77min 16 19 55%
Table 1: Loading experiments for all datasets

Query Performance. In order to assess the effect of the density factor parameter during query processing, we perform a series of experiments on LUBM, Reactome and Geonames. For the workload, we used the sets of queries from [8]. We employ two metrics, namely execution time and number of table permutations. The results can be seen in Figures 5 and 6. As can be seen, hierarchical CS merging can help speed up query performance significantly as long as the dense coverage remains high. For example, in all datasets, query performance degrades dramatically when , in which case the merging process cannot find any dense CSs. In this case, all rows are added to one large table, which makes the database only contain one table with many NULL cells. These findings are consistent across all three datasets and require further future work in order to identify the optimal value for .

In order to assess the performance of raxonDB and establish that no overhead is imposed by the relational backbone, we performed a series of queries on LUBM2000, Geonames and Reactome, assuming the best merging of CSs is employed as captured by with respect to our previous findings. We also compared the query performance with rdf-3x, Virtuoso 7.1, TripleBit and the emergent schema approach described in [13]. The results can be seen in Figure 7 and indicate that raxonDB provides equal or better performance from the original axonDB implementation, as well as the rest of the systems, including the emergent schema approach, which is the only direct competitor for merging CSs. Especially for queries with large intermediate results and low selectivity that correspond to a few CSs and ECSs (e.g. LUBM Q5 and Q6, Geonames Q5 and Q6) several of the other approaches fail to answer fast and in some cases time out.

(a) Execution time (seconds) for LUBM2000
(b) Execution time (seconds) for Geonames
(c) Execution time (seconds) for Reactome
Figure 7: Query execution times in milliseconds for different RDF engines

5 Conclusions and Future Work

In this paper, we tackled the problem of merging characteristic sets based on their hierarchical relationships. As future work, we will study computation of the optimal value for , taking into consideration workload characteristics as well as a more refined cost model for the ancestral paths. Furthermore, we will study application of these findings in a distributed architecture, in order to further scale the capabilities of raxonDB.

References

  • [1] D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.
  • [2] G. Aluç, O. Hartig, M. T. Özsu, and K. Daudjee. Diversified stress testing of rdf data management systems. In ISWC, 2014.
  • [3] M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, and B. Bhattacharjee. Building an efficient rdf store over a relational database. In ACM SIGMOD, 2013.
  • [4] O. Erling and I. Mikhailov. Virtuoso: RDF support in a native RDBMS. Springer, 2010.
  • [5] A. Gubichev and T. Neumann. Exploiting the query structure for efficient join ordering in sparql queries. In EDBT, 2014.
  • [6] Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark for owl knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web, 3(2):158–182, 2005.
  • [7] M. Janik and K. Kochut. Brahms: a workbench rdf store and high performance memory system for semantic association discovery. In ISWC, 2005.
  • [8] M. Meimaris, G. Papastefanatos, N. Mamoulis, and I. Anagnostopoulos. Extended characteristic sets: Graph indexing for sparql query optimization. In ICDE, 2017.
  • [9] G. Montoya, H. Skaf-Molli, and K. Hose. The odyssey approach for optimizing federated sparql queries. In ISWC, 2017.
  • [10] T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In ICDE, 2011.
  • [11] T. Neumann and G. Weikum. The RDF-3x engine for scalable management of RDF data. The VLDB Journal, 19(1):91–113, 2010.
  • [12] N. Papailiou, D. Tsoumakos, I. Konstantinou, P. Karras, and N. Koziris. H 2 RDF+: an efficient data management system for big RDF graphs. In ACM SIGMOD, 2014.
  • [13] M. Pham and P. Boncz. Exploiting emergent schemas to make rdf systems more efficient. In ISWC, 2016.
  • [14] M. Pham, L. Passing, O. Erling, and P. Boncz. Deriving an emergent relational schema from rdf data. In WWW, 2015.
  • [15] A. Schätzle, M. Przyjaciel-Zablocki, A. Neu, and G. Lausen. Sempala: interactive sparql query processing on hadoop. In ISWC, 2014.
  • [16] A. Schätzle, M. Przyjaciel-Zablocki, S. Skilevic, and G. Lausen. S2rdf: Rdf querying with sparql on spark. In VLDB, 2016.
  • [17] L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, and S. Manegold. Column-store support for rdf data management: not all swans are white. In VLDB, 2008.
  • [18] C. Weiss, P. Karras, and A. Bernstein. Hexastore: sextuple indexing for semantic web data management. In VLDB, 2008.
  • [19] K. Wilkinson. Jena property table implementation, 2006.
  • [20] P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, and L. Liu. Triplebit: a fast and compact system for large scale rdf data. In VLDB, 2013.