1 Introduction
Recent works in the state of the art in RDF data management have shown that extraction and exploitation of the implicit schema of the data can be beneficial in both storage and SPARQL query performance [14][13][8][9]. In order to organize on disk, index and query triples efficiently, these trends heavily rely on two structural components of an RDF dataset, namely (i) the notion of characteristic sets (CS), i.e., different property sets that characterize subject nodes, and (ii) the join links between CSs. For the latter, in our previous work, we introduced Extended Characteristic Sets (ECS)[8]
, which are typed links between CSs that exist only when there are objectsubject joins between their triples, and we showed how RDF data management can rely extensively on CSs and ECSs for both storage and indexing, yielding significant performance benefits in heavy SPARQL workloads. However, this approach failed to address schema heterogeneity in looselystructured datasets, as this implied a large number of CSs and ECSs (e.g., Geonames contains 851 CSs and 12136 CS links), and thus, skewed data distributions that impose large overheads in the extraction, storage and diskbased retrieval
[14][8].In this paper, we exploit the hierarchical relationships between CSs, as captured by subsumption of their respective property sets, in order to merge related CSs. We follow a relational implementation approach by storing all triples corresponding to a set of merged CSs into a separate relational table and by executing queries through a SPARQL to SQL transformation. Although, alternative storage technologies can be considered (keyvalue, graph stores,etc), we have selected wellestablished technologies and database systems for the implementation of our approach, in order to take advantage of existing data indexing and query processing techniques that have been proven to scale efficiently in large and complex datasets. To this end, we present a novel system, named raxonDB, that exploits these hierarchies in order to merge together hierarchically related CSs and decrease the number of CSs and the links between them, resulting in a more compact schema with better data distribution. The resulting system, built on top of PostgreSQL, provides significant performance improvements in both storage and query performance of RDF data.
In short, our contributions are as follows:

We introduce a novel CS merging algorithm that takes advantage of CS hierarchies,

we implement raxonDB, an RDF engine built on top of a relational backbone that takes advantage of this merging for both storing and query processing,

we perform an experimental evaluation that indicates significant performance improvements for various parameter configurations.
2 Related Work
RDF data management systems generally follow three storage schemes, namely triples tables, property tables, and vertical partitioning. A triples table has three columns, representing the subject, predicate and object (SPO) of an RDF triple. This technique replicates data in different orderings in order to facilitate sortmerge joins. RDF3X [11] and Hexastore [18] build tables on all six permutations of SPO. Built on a relational backbone, Virtuoso [4] uses a 4column table for quads, and a combination of full and partial indexes. These methods work well for queries with small numbers of joins, however, they degrade with increasing sizes, unbound variables and joins.
Property Tables places data in tables with columns corresponding to properties of the dataset, where each table identifies a specific resource type. Each row identifies a subject node and holds the value of each property. This technique has been implemented experimentally in Jena [19] and DB2RDF [3], and shows promising results when resource types and their properties are welldefined. However, this causes extra space overhead for null values in cases of sparse properties [1]. Also, it raises performance issues when handling complex queries with many joins, as the amounts of intermediate results increase [7].
Vertical partitioning segments data in twocolumn tables. Each table corresponds to a property, and each row to a subject node [1]. This provides great performance for queries with bound objects, but suffers when the table sizes have large variations in size [17]. TripleBit [20] broadly falls under vertical partitioning. In TripleBit, the data is vertically partitioned in chunks per predicate. While this reduces replication, it suffers from the same problems as property tables. It does not consider the inherent schema of the triples in order to speed up the evaluation of complex query patterns.
In distributed settings, a growing body of literature exists, with systems such as Sempala [15], H2RDF [12] and S2RDF [16]. However, these are based on parallelization of centralized indexing and query evaluation schemes.
For these reasons, latest state of the art approaches rely on implicit schema detection in order to derive a hidden schema from RDF data and index/store triples based on this schema. Furthremore, due to the tabular structure that tends to implicitly underly RDF data, recent works have been implemented in relational backbones. In our previous work [8], we defined Extended Characteristic Sets (ECSs) as typed links betwen CSs, and we showed how ECSs can be used to index triples and greatly improve query performance. In [14], the authors identify and merge CSs, similar to our approach, into what they call an emergent schema. However, their main focus is to extract a humanreadable schema with appropriate relation labelling. They do not use hierarchical information of CSs, rather they use semantics to drive the merging process. In [13] it is shown how this emergent schema approach can assist query performance, however, the approach is limited by the constraints of humanreadable schema discovery. In our work, query performance, indexing and storage optimization are the main aims of the merging process, and thus we are not concerned about providing humanreadable schema information or any form of schema exploration. In [9]
, the authors use CSs and ECSs in order to assist cost estimation for federated queries, while in
[5], the authors use CSs in order to provide better triple reordering plans. To the best of our knowledge, this is the first work to exploit hierarchical CS relations in order to merge CSs and improve query performance.3 Hierarchical CS Merging
3.1 Preliminaries
The RDF model does not generally enforce structural rules in the representation of triples; within the same dataset there can be largely diverse sets of predicates emitting from nodes of the same semantic type [8, 14, 10]. Characteristic Sets (CS)[10] capture this diversity by representing implied node types based on the set of properties they emit. Formally, given a collection of triples , and a node , the characteristic set of is .
The set of properties of a CS is denoted with . Furthermore, in a given dataset, each CS represents a set of records identified by a subject node, and all of the values of the subject node (i.e., objects) for the predicates in . We denote the set of all records of as , while is represented by a relational table that is defined by these two elements, i.e., . The tuples in are of the form , where is the identifier column (e.g. URI) of a subject node and are the values, i.e. object nodes, of the properties in for . In the context of this paper, with the term Characteristic Set we will refer collectively to the properties and records of a CS, i.e., its relational table, rather than just the set of properties proposed in the original definition, for the sake of simplicity.
Within a given dataset, CSs often exhibit hierarchical relationships, as a result of the overlaps in their comprising sets of properties. For example, consider two CSs, , describing human beings, with and . It can be seen that and thus is a parent of . This relationship entails an overlap of properties that define the CSs, and can be exploited in order to provide a means to merge common CSs based on the specialization or generalization of the node types they describe. In what follows, we formally define the notions of CS subsumption, hierarchy and ancestral subgraphs.
Definition 1. (CS Subsumption). Given two CSs, and , and their property sets and , then subsumes , or , when the property set of is a proper subset of the property set of , or . This subsumption forms parentchild relationships between CSs. CS subsumption relationships can be seen in Figure 1(a) as directed edges between nodes. The set of all parentchild relationships defines a CS hierarchy as defined in the following.
Definition 2. (CS Hierarchy and Inferred Hierarchy). CS subsumption creates a partial ordering that essentially defines a hierarchy such that when , then is a parent of . Formally, a CS hierarchy is a graph lattice where and . A directed edge between two CS nodes exists in , when and there exists no other such that . An example CS hierarchy can be seen in Figure 1(a). Given a hierarchy , we denote the hierarchical closure of with , so that extends to contain inferred edges between hierarchically related nodes that are not consecutive, e.g. a node and its grandchildren. An example inferred hierarchy can be seen in Figure 1(c) for a subgraph of the graph in Figure 1(a), with the inferred relationships in dashed lines. In the remainder of this paper, we refer to as the inferred hierarchy of .
Definition 3. (CS Ancestral Subgraphs). Given an inferred hierarchy , a CS and set of CSs , then is an ancestral subgraph with as the lowermost child when , it holds that , and . This means that any subgraph with as a sink node will be an ancestral subgraph of . Thus, it holds that . For instance, in Figure 1(c), nodes form an ancestral subgraph with as the base CS. Similarly, nodes and form ancestral subgraphs with as base CS.
Logically, we map each CS to a relational table, so that for a CS we create a relational table , where is the id of the subject and are the properties that belong to , and then we use the CS hierarchy in order to merge the nodes of an ancestral subgraph with as base into a single table. Specifically, we exploit the property set overlap in order to merge together smaller parent CSs with larger child CSs, in order to minimize the effect of NULL values that will appear for properties in smaller CSs that do no exist in the larger CSs. Thus, will be the most specialized CS in its ancestral subgraph. For this reason, we define a merge operator, , as follows.
Definition 4. (Hierarchical CS Merge). Given an ancestral subgraph , where as defined above, then a hierarchical merge of is given as follows: , where . Here, is the most specialized property set in , as does not have any children in , while is the UNION of the records of all CSs in , where is the projection of on . This means that will contain NULL values for all the nonshared properties of and , i.e., . In essence, is an edge contraction operator that merges all nodes of an ancestral subgraph into one, while removing the edges that connect them. For instance, assume that is the set of vertices of an ancestral subgraph with three CSs, with , and . Thus, . Hierarchical merging can be seen in Figure 2.
Definition 5. (Merge Graph). Given an inferred CS hierarchy , a merge graph is a graph that consists of a set of ancestral subgraphs, and has the following properties: (i) contains all nodes in such that , i.e., it covers all CSs in the input dataset, (ii) contains a subset of the edges in such that , (iii) each node is contained in exactly one ancestral subgraph , (iv) all ancestral subgraphs are pairwise disconnected, i.e., there exist no edges between the nodes of different ancestral subgraphs. Thus, each ancestral subgraph can be contracted into one node unambiguously, using the operator. Also, the total number of relational tables will be equal to the number of ancestral subgraphs in the merge graph.
Problem Formulation. Given an inferred CS hierarchy , the problem is to find a merge graph in the form of a set of disconnected ancestral subgraphs, that provides an optimal way to merge CS nodes. In other words, the problem is to find the best set of ancestral subgraphs from an inferred hierarchy that minimize an objective cost function , or more formally:
(1) 
This formulation entails several problems. First, the notion of cost depends on possibly subjective factors, such as the query workload, the storage technology, the input dataset and so on. There is no universal cost model that can be deployed in order to assess the effectiveness of a merge graph. Moreover, neither the number of ancestral subgraphs, nor the set of subgraph roots is known as part of the input. A CS hierarchy of nodes can potentially create subgraphs, while the number of possible subgraph roots is also exponential with the respect to the hierarchy size. Thus, given an arbitrary cost function, this is a problem of nonuniform graph partitioning on the inferred hierarchy
, which is known to be NPHard. That is, even with a deployed cost model, it is still an exponential problem to enumerate all possible subgraphs and find the one with the minimum cost. For these reasons, we approach the problem by deploying a set of rules and heuristics that find a good merge graph efficiently and offer improved storage and query performance, as will be shown in the experiments.
3.2 CS Retrieval and Merging
The primary focus of this work is to improve the efficiency of the storage and query capabilities of relational RDF engines by exploiting the implicit schema of the data in the form of CSs. However, CS merging results in several problems that need to be addressed in this context. These are discussed in what follows.
First, the problem of selecting ancestral subgraphs is a computationally hard one, as mentioned earlier. For this reason, we rely on a simple heuristic in order to seed the process and provide an initial set of ancestral subgraph sink nodes, that will form the bases of the final merged tables, as defined in Definition 3. For this, we identify dense CS nodes in the hierarchy (i.e, with large cardinalities) and use these nodes as the bases of the ancestral subgraphs. While node density can be defined in many different ways, in the context of this work we define a to be dense, if its cardinality is larger than a linear function of the maximum cardinality of CSs in , i.e., a function , with . Here, is called the density factor, and is the cardinality of the largest CS in . This means that, by definition, if , no CSs will be merged, because all CSs will be considered dense and thus each CS will define its own ancestral subgraph, while if , all no ancestral subgraph will be defined, and all CSs will be merged to one large table, as no CS has a cardinality larger than that of the largest CS. With a given , the problem is reduced to finding the optimal ancestral subgraph for each given dense node.
Second, merging tables results in the introduction of NULL values for the nonshared columns, which can degrade performance. Specifically, merging CSs with different attribute sets can result in large numbers of NULL values in the resulting table. Given a parent CS and a child CS with and , the resulting NULL cells will be significantly large compared to the total number of records, thus potentially causing poor storage and querying performance[14]. For this reason, CS merging must be performed in a way that will minimize the presence of NULL values. The following function captures the NULLvalue effect of the merge of two CSs with :
(2) 
Intuitively, represents the ratio of null values to the cardinality of the base CS in the merge. The numerator of the fraction represents the total number of cell values that will be null, as the product of the number of nonshared properties and the cardinality of the parent CS. The denominator represents the cardinality of the base CS. Hence, the base CS must be a descendant (i.e., with more properties) in order to minimize the presence of NULLs.
In order to assess an ancestral subgraph, we use a generalized version of that captures the NULL value effect on the whole subgraph:
(3) 
Here, is the dense root of subgraph . However, merging a parent to a dense child changes the structure of the input graph, as the cardinality of the dense node is increased. To accommodate this, we define a cost function that works on the graph level, as follows:
(4) 
where is the number of dense nodes, is a dense node and is the ancestral subgraph with as the base node.
Given this cost model and a predefined set of dense nodes, our exact algorithm will find the optimal subgraph for each dense node. An inferred hierarchy graph can be converted to a set of connected components that are derived by removing the outgoing edges from dense nodes, since we are not interested in merging children to parents, but only parents to children. An example of this can be seen in Figure 1(b). For each component, we can compute as the sum of the costs of these components. The main idea is to identify all connected components in the CS graph, iterate through these components, enumerate all subgraphs within the components that start from the given set of dense nodes, and select the optimal partitioning for each component.
The algorithm can be seen in Algorithm 1. The algorithm works by first identifying all connected components of the inferred hierarchy (Line 2). Identifying connected components is trivially done using standard DFS traversal, and is not shown in the Algorithm. Then, we iterate each component (Line 3), and for each component, we generate all possible subgraphs. Then, we calculate the cost of each subgraph (Line 7) and if it is smaller than the current minimum, the minimum cost and best subgraph are updated (Lines 89). Finally, we add the best subgraph to the final list (Line 11) and move to the next component.
To generate the subgraphs, we do not need to do an exhaustive generation of combinations, but we can rely on the observation that each nondense node must be merged to exactly one dense node. Therefore, subgraph generation is reduced to finding all possible assignments of dense nodes to the nondense nodes. An example of this can be seen in figure 1. In the figure, nodes are nondense, while nodes are dense. All possible and meaningful subgraphs are enumerated in the table at the right of the figure, where we assign a dense node to each of the nondense nodes. An assignment is only possible if there exists a parentchild relationship between a nondense node and a dense node, even if it is an inferred one (e.g. is an inferred parent of ). Hence, the problem of subgraph generation becomes one of generating combinations from different lists, by selecting one element from each list. The number of lists is equal to the number of nondense nodes, and the elements of each list are the dense nodes that are related to the nondense node.
Complexity Analysis. Assuming that a connected component has nondense nodes and dense nodes, and each nondense node is related to dense nodes, then the number of subgraphs that need to be enumerated are . In the example of figure 1, the total number of subgraphs is . In the worst case all nodes are parents of all nodes. Then, the number of total subgraphs is , which makes the asymptotic time complexity of the algorithm .
3.3 Greedy Approximation
For very small (e.g. ), the asymptotic complexity of is acceptable. However, in realworld cases, the number of connected components can be small, making large. For this reason, we introduce a heuristic algorithm for approximating the problem, that does not need to enumerate all possible combinations, but instead relies on a greedy objective function that attempts to find the local minimum with respect to our defined cost model for each nondense node. Note that it lies beyond the scope of this work to compute the degree of approximation to the optimal solution, however, in our experiments, the heuristic solution is shown to provide significant performance gains.
The main idea behind the algorithm is to iterate the nondense nodes, and for each nondense node, calculate the function and find the dense node that minimizes this function for the given nondense node. Then, the cardinalities will be recomputed and the next nondense node will be examined. The algorithm can be seen in Algorithm 2. In the beginning, the algorithm initiates a hash table, , with an empty list for each dense node (Lines 14). Then, the algorithm iterates all nondense nodes (Line 5), and for each dense node, it calculates the cost of merging it to each of its connected dense nodes (Lines 513), keeping the current minimum cost and dense node. In the end, the current nondense node is added to the list of the dense node that minimizes (Line 14). Notice that we do not need to split the hierarchy into connected components in order for to work.
Complexity Analysis. Given nondense nodes and dense nodes, where each nondense node is related to dense nodes, the algorithm needs iterations, because we need to iterate all nodes for each . In the worst case, every is related to all dense nodes, requiring iterations. Assuming a constant cost for the computation of , then the asymptotic complexity of the greedy algorithm is , which is a significant performance when compared to the exponential complexity of .
Obviously, this process does not necessarily cover all CSs of the input dataset. The percentage of the dataset that is covered by this process is called dense CS coverage. The remainder of the CSs that are not contained by any merge path are aggregated into one large table containing all of their predicates. If the total coverage of the merging process is large, then this large table does not impose a heavy overhead in query performance, as will be shown in the experiments. Finally, we load the data in the corresponding tables.
3.4 Implementation
We implemented raxonDB as a storage and querying engine that supports hierarchical CS merging, and can be deployed on top of standard RDBMS solutions. Specifically, we used PostgreSQL 9.6, but raxonDB can be adapted for other relational databases as well. The architecture of raxonDB can be seen in Figure 4.
CS Retrieval and Merging. The processes of retrieving and merging CSs take place during the loading stage of an incoming RDF dataset. CS retrieval is a trivial procedure that requires scanning the whole dataset and storing the unique sets of properties that are emitted from the subject nodes in the incoming triples, and is adopted from our previous work in [8] where it is described in detail. After retrieving the CSs, the main idea is to compute the inferred CS hierarchy and apply one of the described merging algorithms. Finally, each set of merged CSs is stored in a relational table. In each table, the first column represents the subject identifier, while the rest of the columns represent the union of the property sets of the merged CSs. For multivalued properties, we use PostgreSQL’s array data type in order to avoid duplication of the rows.
Indexing. We deploy several indexes in raxonDB. First off, we index the subject id for each row. We also build foreignkey indexes on objectsubject links between rows in different CSs, i.e., when a value of a property in one CS is the subject id of another CS. Next, we use standard B+tree for indexing singlevalued property columns, while we use PostgreSQL’s GIN indexes, which apply to array datatypes for indexing multivalued properties. This enables fast access to CS chain queries, i.e., queries that apply successive joins for objectsubject relationships. Furthermore, we store these links on the schema level as well, i.e., we keep an index of CS pairs that are linked with at least one objectsubject pair of records. These links are called Extended Characteristic Sets (ECSs) and are based on our previous work in [8]. With the ECS index, we can quickly filter out CSs that are guaranteed not to be related, i.e., no joins exist between them, even if they are individually matched in a chain of query CSs. Other metadata and indexes include the property sets of CSs, and which properties can contain multiple values in the same CS.
Query Processing. Processing SPARQL queries on top of merged CSs entails (i) parsing the queries, (ii) retrieving the query CSs, (iii) identifying the joins between them, and (iv) mapping them to merged tables in the database. Steps (i)(iii) are inherited from our previous work in [8]. For (iv), a query CS can match with more than one table in the database. For instance, consider a query containing a chain of three CSs, , joined sequentially with objectsubject joins. Each query CS matches with all tables whose property sets are supersets of the property set of . Thus, each join in the initial query creates a set of permutations of table joins that need to be evaluated. For instance, assume that matches with , while matches with and matches with . Furthermore, by looking up the ECS index, we derived that the links , , and are all valid, i.e., they correspond to candidate joins in the data. Then, , , and are all valid table permutations that must be processed. Two strategies can be employed here. The first is to join the UNIONs of the matching tables for each , and the other is to process each permutation of tables separately and append the results. Given the filtering performed by the ECS indexing approach, where we can prefilter CSs based on the relationships between them, the UNION would impose significant overhead and eliminate the advantage of ECS indexing. Therefore, we have implemented the second approach, that is, process a separate query for each permutation. Finally, due to the existence of NULL values in the merged tables, we must add explicit IS NOT NULL restrictions for all the properties that are contained in each matched CS and are not part of any other restriction or filter in the original query.
4 Experimental Evaluation
We implemented raxonDB on top of PostgreSQL^{1}^{1}1The code and queries are available in https://github.com/mmeimaris/raxonDB. We did not extend our previous native RDF implementation of axonDB [8], because given the underlying relational schema of the CS tables, we decided to rely on a wellestablished relational engine for both the planning and the execution of queries, instead of reimplementing it. As the focus of this paper is to improve RDF storage and querying efficiency in relational settings, we rely on existing mechanisms within PostgreSQL for I/O operations, physical storage and query planning. In this set of experiments, we report results of implementing with the greedy approximation algorithm, as experimenting with the optimal algorithm failed to finish the merging process even in datasets with small numbers of CSs.
Datasets. For this set of experiments, we used two synthetic datasets, namely LUBM2000 (300m triples), and WatDiv (100m triples), as well as two realworld datasets, namely Geonames (170m triples) and Reactome (15m triples). LUBM [6] is a customizable generator of synthetic data that describes academic information about universities, departments, faculty, and so on. Similarly, WatDiv[2] is a customizable generator with more options for the production and distribution of triples to classes. Reactome^{2}^{2}2http://www.ebi.ac.uk/rdf/services/reactome is a biological dataset that describes biological pathways, and Geonames^{3}^{3}3http://www.geonames.org/ontology/documentation.html is a widely used ontology of geographical entities with varying properties. Geonames maintains a rich graph structure as there is a heavy usage of hierarchical area features on a multitude of levels.
Loading. In order to assess the effect of hierarchical merging in the loading phase, we performed a series of experiments using all four datasets. For this experiment, we measure the size on disk, the loading time, the final number of merged tables, as well as the number of ECSs (joins between merged tables) and the percentage of triple coverage by CSs included in the merging process, for varying values of the density factor . The results are summarized in Table 1. As can be seen, the number of CS, and consequently tables, is greatly reduced with increasing values of . As the number of CSs is reduced, the expected number of joins between CSs is also reduced, which can be seen in the column that measures ECSs. Consequently, the number of tables can be decreased significantly without trading off large amounts of coverage by dense CSs, i.e. large tables with many null values. Loading time tends to be slightly greater as the number of CSs decreases, and thus the number of merges increases, the only exception being WatDiv, where loading time is actually decreased. This is a sideeffect of the excessive number of tables () in the simple case which imposes large overheads for the persistence of the tables on disk and the generation of indexes and statistics for each one.
Dataset  Size (MB)  Time  # Tables (CSs)  # of ECSs  Dense CS 
Coverage  
Reactome Simple  781  3min  112  346  100% 
Reactome (m=0.05)  675  4min  35  252  97% 
Reactome (m=0.25)  865  4min  14  73  77% 
Geonames Simple  4991  69min  851  12136  100% 
Geonames (m=0.0025)  4999  70min  82  2455  97% 
Geonames (m=0.05)  5093  91min  19  76  87% 
Geonames (m=0.1)  5104  92min  6  28  83% 
LUBM Simple  591  3min  14  68  100% 
LUBM (m=0.25)  610  3min  6  21  90% 
LUBM (m=0.5)  620  3min  3  6  58% 
WatDiv Simple  4910  97min  5667  802  100% 
WatDiv (m=0.01)  5094  75min  67  99  77% 
WatDiv (m=0.1)  5250  75min  25  23  63% 
WatDiv (m=0.5)  5250  77min  16  19  55% 
Query Performance. In order to assess the effect of the density factor parameter during query processing, we perform a series of experiments on LUBM, Reactome and Geonames. For the workload, we used the sets of queries from [8]. We employ two metrics, namely execution time and number of table permutations. The results can be seen in Figures 5 and 6. As can be seen, hierarchical CS merging can help speed up query performance significantly as long as the dense coverage remains high. For example, in all datasets, query performance degrades dramatically when , in which case the merging process cannot find any dense CSs. In this case, all rows are added to one large table, which makes the database only contain one table with many NULL cells. These findings are consistent across all three datasets and require further future work in order to identify the optimal value for .
In order to assess the performance of raxonDB and establish that no overhead is imposed by the relational backbone, we performed a series of queries on LUBM2000, Geonames and Reactome, assuming the best merging of CSs is employed as captured by with respect to our previous findings. We also compared the query performance with rdf3x, Virtuoso 7.1, TripleBit and the emergent schema approach described in [13]. The results can be seen in Figure 7 and indicate that raxonDB provides equal or better performance from the original axonDB implementation, as well as the rest of the systems, including the emergent schema approach, which is the only direct competitor for merging CSs. Especially for queries with large intermediate results and low selectivity that correspond to a few CSs and ECSs (e.g. LUBM Q5 and Q6, Geonames Q5 and Q6) several of the other approaches fail to answer fast and in some cases time out.
5 Conclusions and Future Work
In this paper, we tackled the problem of merging characteristic sets based on their hierarchical relationships. As future work, we will study computation of the optimal value for , taking into consideration workload characteristics as well as a more refined cost model for the ancestral paths. Furthermore, we will study application of these findings in a distributed architecture, in order to further scale the capabilities of raxonDB.
References
 [1] D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.
 [2] G. Aluç, O. Hartig, M. T. Özsu, and K. Daudjee. Diversified stress testing of rdf data management systems. In ISWC, 2014.
 [3] M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, and B. Bhattacharjee. Building an efficient rdf store over a relational database. In ACM SIGMOD, 2013.
 [4] O. Erling and I. Mikhailov. Virtuoso: RDF support in a native RDBMS. Springer, 2010.
 [5] A. Gubichev and T. Neumann. Exploiting the query structure for efficient join ordering in sparql queries. In EDBT, 2014.
 [6] Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark for owl knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web, 3(2):158–182, 2005.
 [7] M. Janik and K. Kochut. Brahms: a workbench rdf store and high performance memory system for semantic association discovery. In ISWC, 2005.
 [8] M. Meimaris, G. Papastefanatos, N. Mamoulis, and I. Anagnostopoulos. Extended characteristic sets: Graph indexing for sparql query optimization. In ICDE, 2017.
 [9] G. Montoya, H. SkafMolli, and K. Hose. The odyssey approach for optimizing federated sparql queries. In ISWC, 2017.
 [10] T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. In ICDE, 2011.
 [11] T. Neumann and G. Weikum. The RDF3x engine for scalable management of RDF data. The VLDB Journal, 19(1):91–113, 2010.
 [12] N. Papailiou, D. Tsoumakos, I. Konstantinou, P. Karras, and N. Koziris. H 2 RDF+: an efficient data management system for big RDF graphs. In ACM SIGMOD, 2014.
 [13] M. Pham and P. Boncz. Exploiting emergent schemas to make rdf systems more efficient. In ISWC, 2016.
 [14] M. Pham, L. Passing, O. Erling, and P. Boncz. Deriving an emergent relational schema from rdf data. In WWW, 2015.
 [15] A. Schätzle, M. PrzyjacielZablocki, A. Neu, and G. Lausen. Sempala: interactive sparql query processing on hadoop. In ISWC, 2014.
 [16] A. Schätzle, M. PrzyjacielZablocki, S. Skilevic, and G. Lausen. S2rdf: Rdf querying with sparql on spark. In VLDB, 2016.
 [17] L. Sidirourgos, R. Goncalves, M. Kersten, N. Nes, and S. Manegold. Columnstore support for rdf data management: not all swans are white. In VLDB, 2008.
 [18] C. Weiss, P. Karras, and A. Bernstein. Hexastore: sextuple indexing for semantic web data management. In VLDB, 2008.
 [19] K. Wilkinson. Jena property table implementation, 2006.
 [20] P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, and L. Liu. Triplebit: a fast and compact system for large scale rdf data. In VLDB, 2013.
Comments
There are no comments yet.