Multi-SQL: An extensible multi-model data query language

11/17/2020 ∙ by Yu Yan, et al. ∙ 0

Big data management aims to establish data hubs that support data in multiple models and types in an all-around way. Thus, the multi-model database system is a promising architecture for building such a multi-model data store. For an integrated data hub, a unified and flexible query language is incredibly necessary. In this paper, an extensible and practical query language–Multi-SQL is proposed to realize the unified management of multi-model data considering the co-processing of multi-model data. To the best of our knowledge, Multi-SQL is the first query language based on various data models. We take four popular data models, i.e., RELATION, GRAPH, KV, DOCUMENT, as the foundation of Multi-SQL design. Multi-SQL can also be expanded to suit more complicated scenarios as it is flexible to add other data models. Moreover, we provide a formal semantic definition of the core features of Multi-SQL and propose an effective algorithm based on equivalence class rule (ECT) to process multi-query. Compared with some existing popular databases' query languages, experimental results demonstrate that Multi-SQL is remarkable in the case of complex multi-queries.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the advent of the big data age, different types and formats of data have risen explosively. The integrated management of multiple data has become a trend, and the demand for a unified and flexible query language has become more and more urgent.

Even though existing single-model database management systems are efficient for specific model data[2-7], complex computer application scenarios involve multiple data models, large amounts of data, and high coupling between data, making single-model databases inflexible and inefficient. For example, in the social network, user information such as user IDs is often stored in a relational table, and the relationships between users are stored as a large graph. When obtaining the relationship between users through IDs, with individual databases for single models, the programmer first needs to query the IDs from the relational database, and then obtain the user relationship from the graph database through IDs, which is pretty complex. From this example, individual single-model databases hardly meet the needs of multi-model data scenarios.

Currently, there are two main approaches for processing and managing multi-model data: (1) polyglot persistence [8] and (2) multi-model database [9-10].

The central idea of polyglot persistence is to use different databases and platforms to store data in various models and then develop middlewares to centralize them. For instance, Rheem [11] determines the best platform for performing data tasks, and BigDAWG [12] demonstrates an architecture that supports various database systems with different data models through a unified multi-island interface middleware. In general, the polyglot persistence approach’s complexity and cost are relatively high because the integration of numerous independent databases brings massive engineering and operating costs. Besides, to answer cross-model queries, each database needs to keep running, which reduces the fault tolerance of the entire system.

A multi-model database is an entire database that uses a fully integrated back-end to manage different data models [13]. The development of multi-model databases is very rapid, and many single-model databases have also been transformed to support multi-model data. For example, OrientDB [14] nowadays supports the RDF, key-value, and JSON models. Riak [15] supports the graph, key-value, and XML models. Compared with polyglot persistence, the multi-model database approach can reduce integration, migration, development and maintenance costs on multiple platforms.

Since most existing multi-model databases are transformed from single-model databases, there are some problems with multi-model databases’ current query languages. In terms of the design idea, the existing multi-model query languages, such as SQL-extend [16] and JSON-extend [17], are not based on the multi-model concept. Each addition of a new data model makes the language more logically complicated and leads to low flexibility. In terms of the specific implementation, the existing multi-model query language such as Cassandra Query Language(CQL)[18] and AQL of ArangoDB[19] is based on a single-model database language like SQL[1] to expand, rewrite, or translate. The efficiency is not high enough, and there are some inevitable losses in the process. For example, AQL does not support data definition operations, such as creating and deleting databases, collections, and indexes. Also, in order to support queries on XML, extend-SQL used in PostgreSQL adds operation Xpath. However, when using extend-SQL to perform complex XML queries, users must first decompose the complex query by themselves instead of directly perform complex queries, which reduces efficiency.

To solve the above problems, we propose an extensible multi-model query language—Multi-SQL from a multi-model design perspective. We define an easy-to-learn language standard with high flexibility to realize the unified management of multi-model data. Multi-SQL can efficiently create multi-model data objects, store multi-model data, perform operations like retrieve, modify, and delete efficiently. Our main contributions are as follows:

(1) We design a multi-model query language–Multi-SQL and gives the language standard, which solves inflexibility, high cost, and low-efficiency problems of multi-model data management. (2) We formalize the data model of Multi-SQL with the four mainstream data models, namely relation model, graph model, document model, and key-value model.(3) We define the major grammatical rules of Multi-SQL, giving the syntax and operating specifications. (4) We design a query optimization for Multi-SQL. The queries can be decomposed into many sub-queries and then decomposed into many topological sequence layers during execution. (5) We compare Multi-SQL with our federal database’s(A combination of single-model databases) native query languages under a social media scenario. Experimental results demonstrate that optimized Multi-SQL is the best with a 22% percent improvement compared with the native Multi-SQL and query languages of single-model databases in the complex multi-queries

The rest of this paper is organized as follows. We introduce the architecture overview of Multi-SQL, describe the design ideas and macro model in Section 2. In Section 3, we use a multi-modal example to elaborate on the use of Multi-SQL. In Section 4, we carry out a detailed and standardized definition for the data model of Multi-SQL. The specific syntax definition and operation definition of Multi-SQL are provided in Section 5, and the process strategy of Multi-SQL is provided in Section 6. In Section 7, we show the superiority of Multi-SQL through comparative experiments. Finally, we draw a conclusion of this paper in Section 8.

2 The Architecture of Multi-SQL

Multi-SQL is a declarative query language, and it provides functions for querying and modifying multi-model data and specifying schema definitions. Multi-SQL considers that different data models are interconnected and interdependent in practice, but they are independent in storage. Thus, from a multi-model perspective, each data model needs to be treated equally with independent definitions. In order to ensure flexibility, Multi-SQL is easy to expand, and for ease of use, some statements are similar to SQL. Figure 1 shows the architecture of Multi-SQL. In the system, Multi-SQL can be mutually transformed with multi-modal views, and they can all manipulate data of different data models. The data models are abstracted based on the data in the multi-model database.

Figure 1: The Structure of Multi-SQL

In order to manipulate multi-model data, Multi-SQL defines the factors of multi-model data, including multi-model database, multi-model entity objects, and multi-model views. Multi-SQL defines entity objects with typical data models, i.e. relational tables, document sets, graphs, and key-value tables as, corresponding to keywords RELATION, DOCUMENT, GRAPH, and KV.

Multi-SQL supports multiple operations of these data models. Basic data operations include using DELETE to remove data and entities, using UPDATE for updating properties, and INSERT for adding data. Multi-SQL defines atomic filtering operation and data connection operation for multi-model queries. Atomic filtering operation is to perform a filtering operation in a single data model at a time. The data connection operation acts based on atomic filtering operations using equivalent join, and the extension to other types of join is trivial. During the query process, Multi-SQL queries are decomposed into atomic filtering operations, and the returned results of each model are merged by data connection operations to obtain multi-model query results. Model conversion refers to the transformation of data storage format, which is a special operation of Multi-SQL. It is achieved by using the keyword TRANSFER, and the corresponding relationship needs to be established during the conversion process.

3 Multi-SQL By Example

Figure 2 shows an example containing four data models in a social media scenario. Through this scenario, we give an example of Multi-SQL in the single-model query, multi-model query, and model conversion.

(a) Graph.
(b) Document.
(c) Relation.
(d) Key-value.
Figure 2: The Structure of Multi-SQL

First, we define a multi-modal database BlogsSystem to store these social media data.

[CREATE DATABASE BlogsSystem]

Query Q1 returns the ids of male users who follow user 002. It is a single-model data query that only involves the graph data model. For a single-model data query, only one entity needs to be created and initialized. Like SQL, the pattern given in the first SELECT clause in Line 3 selects all users in q1. JOIN is a connection operation that connects multiple objects according to specific requirements or conditions. q1 represents the result obtained through the JOIN operation in Line 4. The objects of JOIN can be multi-model objects or other joins, or the results of atomic filtering operations. In this query, the objects of JOIN here are two SELECT statements in Line 5 and Line 6. To better express complex sentences, AS statement is used to replace complicated names. Through the condition of Line 7, JOIN obtains the male users who follow 002.

Q1 CREATE ENTITY GRAPH BlogGra INIT GRAPH BlogGra SELECT q1.s JOIN q1 (s = BlogGra.s) SELECT BlogGra.s AS a.s WHERE BlogGra.o= “Male” and BlogGra.p = “gender”, SELECT BlogGra.s AS b.s WHERE BlogGra.o= “002” and BlogGra.p = “follows” a.s=b.s

A multi-model data query is an expansion on the basis of a single-modal query. Q2 queries the number of logins in the past week of senior members who posted posts in May. Since Q2 involves three data models, CREATE data model entities including GRAPH, RELATION and DOCUMENT and initialize them separately (from Line 1 to Line 6). Multi-model queries can be split into multiple single-model queries and Q2 can be split into relation model query, document model query and graph model query. The results of the segmentation are as follows:
(1) SELECT Doc.authorid WHERE Doc.month = “May”
(2) SELECT Gra.s WHERE Gra.p=”rdf:type” and Gra.o=” Senior Member”
(3) SELECT device, count(device) as devicecount WHERE uid =[*] and logtime between [t] and [t-7d] GROUP BY device
In order to execute (3) in the relation model, (1) and (2) must be executed first in the document model and the graph model. Line 10-13 return the JOIN result of (1) and (2), among them (1) returns users who posted posts in May, and (2) returns users who are senior members. SELECT in Line 9 is the extraction from JOIN results and IN means that the complete result of JOIN is used. In Line 7, 8 and 14, (3) gets the number of times a user logged in to the device in the past seven days. Finally, execute (3) to calculate the number of logins in the past week of senior members who have posted posts in May.

Q2 CREATE ENTITY RELATION Rel CREATE ENTITY GRAPH Gra CREATE ENTITY DOCUMENT Doc INIT RELATION Rel INIT GRAPH Gra INIT DOCUMENT Doc SELECT Rel.device, count(Rel.device) as devicecount WHERE Rel.logtime between [t] and [t-7d] and Rel.uid IN (SELECT userrel.id IN JOIN userrel (id = Doc.authorid) SELECT Doc.authorid WHERE Doc.month = “May”, SELECT Gra.s WHERE Gra.p=”rdf:type” and Gra.o=” Senior Member” Doc.authorid = Gra.s) GROUP BY Rel.device

The special operation TRANSFER in Multi-SQL can convert data from one data model into another data model. In actual situations, if the DBA finds that a data set stored in a particular data model is complicated or takes a long time during the call, she can choose to change the data set’s data model. Q3 transforms the data in the document model into relation model. In Q3, first create and initialize a new relation entity to store the data transferred from the document model. TRANSFER in Line 3 provides some one-to-one mappings to convert document data into relational data. The form of these data is consistent or can be changed to a required data format.

Q3 CREATE ENTITY RELATION Doc2rel INIT RELATION Doc2rel TRANSFER Doc into Doc2rel (Doc2rel.idDoc.id,Doc2rel.authoridDoc.authorid,Doc2rel.topicDoc.topic, Doc2rel.dateDoc.date,Doc2rel.postsDoc.posts)

4 Data Model

In this section, we provide a formal specification of the data model, which is the theoretical basis of Multi-SQL. The formal specification can be used for reasoning to prove the rationality of Multi-SQL, and it is stated in a standardized way that Multi-SQL can support different operations of multiple data models and perform collaborative queries. Besides, when giving the formal specification, scalability is considered as much as possible to ensure compatibility when new data models appear. Throughout the description below, we use the notation conventions in Table 1.

Concept Notation Set of notation
Constants a, b, c -
Values v
Variables x, y ,
Attributes A, B, C , ,
Relation name r
Tuples t, s -
Property p
Node n
Table 1: Notation conventions.

4.1 Basic definition

The data models include the definition of values and the definition of basic operations. It is the basis for all data models.

4.1.1 Value

We assume two basic types for further definition: the integers , and the type of finite strings over a finite alphabet . The value set of Multi-SQL is . Assuming that the set is infinite, the value space is as follows:

  1. Elements of are values.

  2. Elements of are values.

  3. True, False and Null are values.

  4. An empty list is a value. If ,…, are values, then list[,…,] (m0) is a value.

  5. An empty map is a value. If ,…, are values and ,…, are distinct property keys, then map{:,…,:} (m0) is a value.

4.1.2 Operation

In this paper, we define a finite basic function family including string concatenation function, absolute value function, and other basic functions that can be applied to perform value operations. We also define some extensible basic operations, which can be executed on single-data model. Basic operations include equal to (), greater than (), less than (), minimum (MIN), maximum (MAX), count the number (COUNT), average (AVG), sum (SUM), EXCEPT and so on. The functions of these operations are similar to those in SQL.

Multi-SQL has good extensibility, and the value and operations can be expanded by defining new basic types and basic functions. All reasoning and deduction in Multi-SQL are generated based on the above value and functions.

4.2 Relation model

For a multi-model database, an independent, complete, and streamlined data model can better deal with complicated relationships and storage issues. We design a relation model based on the most primitive requirements of relational data. This section introduces the relation model from two aspects: the structure of the model and the operations.

4.2.1 Structure

The relation structure rules are defined as follows in which is the attribute name, and is the domain corresponding to the attribute.

=(,,),(,,) (n=0,1,2…)

For example, given one tuple , i n, and is the th line in .

=(, , ) (n=0,1,2…)

4.2.2 Operation

There are two primary considerations for operations on the relational data model: the operations on a single relational object and the operations on multiple relational objects.

For single object operation:

(R) represents SELECT, which means choosing the tuples that meet the given conditions in relation . is the notation of selection criteria, and its value is True or False. The general expression of is , where is a comparison operator.

(R) represents PROJECTION, which means extracting specific attributes in relation . represents a collection of attributes that need to be projected.

For multiple objects operation:

JOIN RS is an operation that first goes through the Cartesian product and then selects according to conditions. LEFT JOIN returns all rows from the left table, even if there is no matching row in the right table. RIGHT JOIN and LEFT JOIN are the same but in opposite direction.

Union R+S is used to merge the result sets of two or more statements.

4.3 Key-value model

For the key-value model, its structure is simple, and it can stably support the storage and efficient query of massive data. Large amounts of data with simple relationships are suitable for storage in the key-value model.

4.3.1 Structure

The key-value structure is defined as follows:

= (KEY, VALUE),

For KEY, VALUE, ()= is a one-to-one mapping.

4.3.2 Operation

Similar to SELECT in relational operations, the main operation in key-value model is GET.

(KV) represents GET, which means getting tuples form the KV table under the condition .

4.4 Document model

We regard the document data as a list of ordered trees, which is composed of nodes and one-to-many relationships.

4.4.1 Structure

Where is the ordered schema, and is the domain list of document nodes.

4.4.2 Operation

There are two special operations in document data, single-path-query (SPQ) and complex-structural-query (CSQ). We regard both SPQ and CSQ as a collection of MATCH operations.

MATCH (D), where is a boolean expression. The result of (D) is a list of documents which satisfy the boolean expression.

Example 1

: In social media, the user’s posts information can be stored using the document model. Given a certain document dataset with a ordered schema in Figure 3(a). For the single-path-query marked in red in Figure 3(b), it can be displayed as (D). (D) represents the complex-structural-query shown in blue in Figure 3(c).

(a) Document D.
(b) SPQ.
(c) CSQ.
Figure 3: Document model

4.5 Graph model

We set the graph model as a complex structure composed of nodes and many-to-many relationships. Compared with other data models, complex graphs perform better in storing network datasets[20].

4.5.1 Structure

, where is the collection of nodes, is the collection of edges, is the collection of properties and is the property edges.

4.5.2 Operation

The operations in graph model is quite flexible. Users could start a query from a node, edge or a property. Similar to the document model, all operations in graph are also composed of automatic matching.

Given a graph . (G) represents the filtering node operation, where is a boolean expression. Similarly, we define expressions for filtering edges, filtering properties and filtering property edges: (G), (G), (G).

Example: In Figure 4, if we want to query node , we can use operation (G). If we query the edge between n1 and n2, it is represented as (G). Similarly, the query of properties and property edges is expressed as (G) and (G). For a longer path, we can use a list composed of these four operations.

Figure 4: Graph Model

5 Syntax and Semantics

The syntax of Multi-SQL includes two modules, one is the data definition module, and the other is the data manipulation module. The data definition language is responsible for the definition and initialization of data objects and views, and the data manipulation language is responsible for storing, deleting, modifying, and querying data.

5.1 Data definition language

As shown in Table 2, the keywords of the data definition language mainly include CREATE, RELATION, GRAPH, KV, DOCUMENT, KEY, SCHEMA, etc. For a data model object, we need to create and then initialize it. The initialization process is different for each modal. The relation modal needs to specify each column and the primary key of the table. The document model needs to formulate the document pattern. The column names and key-value relationships need to be specified in the KV model. The graph model does not need to set anything.

OBJECT CREATION CREATE TYPE MDNAME TYPE = ONE OF DOCUMENT, GRAPH, RELATION, KV MDNAME = DEFINED BY USER OBJECT INITIALIZATION INIT TYPE MDNAME WITH DATA SCHEMA DATA SCHEMA=COLUMN LIST KEY PRIMARY DOCSCHEMA KEY NAME, VALUE NAME VIEW CREATION CREATE VIEW VTYPE VIEWNAME AS Q VTYPE= ONE OF MULTI, SINGLE VIEWNAME = DEFINED BY USER

Table 2: Data definition language

5.2 Data manipulation language

Data manipulation language in Table 3 is mainly composed of data query statement, data insertion statement, data update statement, and data deletion statement.

QUERY Q = SELECT [DISTINCT] (Q) [WHERE CLAUSE] [GROUP BY CLAUSE] [ORDER BY CLAUSE] (Q)= LIST OF ATTRIBUTION ATTRIBUTION = KEY IN GRAPH (PARTIAL GRAPH) DOCUMENT (SUB DOCUMENT) RELATION (PARTIAL RELATION) KV (PARTIAL KV) Q JOIN UNION WHERE CLAUSE=SOME BOOLEAN EXPRESSION OF CONDITION CLAUSE CONDITION CLAUSE = OBJECT OR VAL OP OBJECT OR VAL OP = , , , , = LEFT RIGHT JOIN JNAME OPTION [CO-RELATION] OBJECT [OBJECT1 OBJECT2] WHERE CLAUSE CO-RELATION = LIST OF ATTRIBUTIONATTRIBUTION Q1 UINON [ALL] Q2 INSERTION INSERT ITYPE IOBJECT MULTIVAL VAL LIST ITYPE = ONE OF DOCUMENT, GRAPH, ,RELATIONKV IOBJECT = LIST OF MDNAME VAL LIST = LIST OF DOCUMENT GRAPH KV RELATION UPDATE UPDATE UTYPE IOBJECT UTYPE = ONE OF DOCUMENT, GRAPH, RELATION,KV IOBJECT = LIST OF MDNAME DELETION DELETE DTYPE IOBJECT IOBJECT MULTIVAL VAL LIST DTYPE = ONE OF DOCUMENT, GRAPH, RELATION,KV IOBJECT = LIST OF MDNAME VAL LIST = LIST OF DOCUMENT GRAPH KV RELATION TRANSFER TRANSFER Q1 INTO Q2 WITH CO-RELATION CO-RELATION = LIST OF ATTRIBUTIONATTRIBUTION

Table 3: Data manipulation language

Multi-SQL data query statement takes into account the characteristics of each model and all multi-modal queries can be composed by simple atomic filtering, join, and union operations. A query consists of objects and WHERE, GROUP BY, ORDER BY assertions recursively, and can be easily parsed into an execution tree.

The data insertion statement includes the insertion of multi-model data and single-model data. It is composed of three clauses: ITYPE is used to specify the type of data model inserted (relation, graph, KV, or document data model), IOBJECT is used to specify the inserted data object, and VAL LIST is used to specify the corresponding data. Similarly, data update and data deletion statements are also defined in the table.

With the growth of data volume and the change of computer application workload types, the original data model may be inadequate for the existing data access requirements. We design a model conversion statement that supports the conversion between models. The conversion object can be the data model itself or a query result that meets the constraint specification. The existing mainstream query languages like SQL-extend and CQL have not yet supported data model conversion.

6 Query Processing

This section introduces the approach of multi-model query processing, and gives a query optimization method based on equivalence class rule.

Multi-SQL query processing in Algorithm 1 is to continuously decompose complex queries through a query tree until it is decomposed into a combination of atomic filtering, join and union operations. At this point, we obtain a topological set of sorted operations, and the operations in this set are hierarchical. Only when the pre-operation of the topology is executed, the post-operation can be performed. Atomic filtering operation only involves a single model, which can be easily assigned to the corresponding data model for execution. Finally, the multi-model query results are obtained layer by layer according to the underlying data model’s execution results.

input : Topological sequence Q, constant , equivalence class rule E
output : Optimized query execution tree T
1 Query-Process (Q, , E)
2 T
3 while Q do
4       Option extract (Q)//Extract executable clauses in partially ordered set Q
5       Q Q - Option
6       while Option do
7             for item in Option do
8                   if Cost(item)  then
9                         e e + item
10                   end if
11                  
12             end for
13            T, Option T + e, Option - e
14             Option Optimize-SubQuery (Option, T, E)
15             if not e then
16                   T T + Option
17                   break
18             end if
19            
20       end while
21      
22 end while
return T
Algorithm 1 Query Optimization

In the above query processing process, an algorithm based on the equivalence class rule is used to optimize the query, as shown in Algorithm 2. Given a constant , the algorithm selects operations with cost less than from the optional operation set Option each time to form the current executable set e. The selected operations can be used to optimize the operations in the unselected operation set Option-e. The unselected operations can be optimized through user-defined equivalence class specifications. If an operation cannot be optimized, will be executed last. Finally, the query processing tree is generated. There are triple loops in this optimization algorithm, but the algorithm does not need to consume much time and resources. Since the clause size of a query could be treated as a constant, each iteration can be completed in a constant number of times.

input : Set to be optimized , optimize base , equivalent class rule
output : Result after optimization ROption
1 Optimize-SubQuery (Option, T, E)
2 ROption
3 for item in Option do
4       V item.Object.attribution()
5       for e in E do
6             if e V then
7                   for v in V do
8                         if IsBasic(v, T) then
9                               item item + extractBasic(v, T)
10                         end if
11                        
12                   end for
13                  
14             end if
15            
16       end for
17      ROption ROption + item
18 end for
return ROption
Algorithm 2 Optimize-SubQuery

Below, we explain the query process in detail through a simple query under the social media scenario: query users who registered earlier than 2012, logged in in the past week, and posted posts within half a year. When Multi-SQL executes the query, this query is processed into a multi-layer topology sequence. This topology sequence can be multi-layered, not limited to three layers in this example.

In the first-level topology sequence, this query is filtered into the following three sub-queries, Option={q1,q2,q3}:
q1: Query users with ”registerdate2012” in the relation model
q2: Query users with ”logtimet-7” in the relation model
q3: Query users with ”postdateT-365/2” in the document model

In the second-level topology sequence, Multi-SQL predicts the cost of each sub-query and then selects appropriate ones for execution:
Because Cost(q1), Cost(q2) and Cost(q3), q1 and q2 are executed first. Option={q3}, T={q1,q2}

In the third-level topology sequence, q3 can be optimized by the results of q1 and q2 based on the equivalence class to reduce query time:
Optimize-SubQuery (, , ), and then execute the optimized q3. Finally, the execution tree ={q1,q2,q3} is obtained.

7 Experiment

In order to verify the effectiveness of Multi-SQL, we implement the four basic data models and query processing of the language. Since there is no standard multi-model benchmark, we generate some multi-model data based on the popular social media scene[21] and design some multi-model queries for verify the efficiency of Multi-SQL.

7.1 Data Generation

Facebook is a typical social media scene, which contains both relational data like basic user information, document data like posts, and graph data like social networks. We use the this social media scenario as a prototype[21] to abstract the multi-model data model. As shown in Figure 5, the basic information of the user is stored as relation model, the posts posted by the user are stored as document model, the password is stored as KV database, and the social relationship between users is stored as graph model. During the experiment, the number of users, behaviors, posts, likes are 10,000 100,000 100,000 100,000. The graph data is abstracted into [Subject, Predicate, Object] form, and there are 600,000 pieces.

Figure 5: Social media data stored as four data models

7.2 Query Performance Test

The operating system used in the test is Centos7, and we use MongoDB111https://www.mongodb.com/, Cassandra222https://cassandra.apache.org/, PostgreSQL333https://www.postgresql.org/, MySQL444https://www.mysql.com/, and Neo4J555https://neo4j.com/ for the underlying storage of different data models.

We design two workloads in Table 4 and Table 5 to test Multi-SQL’s performance. For the queries in Workload 1, the data models involved are highly coupled. Therefore, when Workload 1 uses Multi-SQL for query, it will be decomposed into many sub-queries and decomposed into many layers of topological sequences during execution. These queries are flexible for optimization. The queries in workload 2 are the opposite, and the data models involved are low in coupling. For these queries, the optimization space in Multi-SQL is relatively small.

Finally, we compare the execution efficiency of queries using single-model databases’ query languages, native Multi-SQL and optimized Multi-SQL.

Query
1 Query the likes of users whose ages are between 20-30 and have posted posts on the date of 20090821.
2 Query the likes of users whose ages are between 35-45 and have posted posts on the date of 20080724.
3 Query the likes of users who have posted posts between 20190607-20190718, registered before 20090312, and logged in on 20101223.
4 Query the likes of users who have posted posts between 20190101-20190601, registered before 20050101, and logged in on 20081003.
Table 4: Workload 1
Query
1 Query the login statistics of female users who have followed user 04078 and posted posts with the topic ”School” between 20200701 and 20200707.
2 Query the general members who have followed user 04088, occupation is student and registered before 20110101.
3 Query the general members who have followed user 04172, occupation is artist and registered before 20120101.
4 Query the number of likes of senior users who have commented on posts with the topic ”Pneumonia”.
Table 5: Workload 2

7.3 Result

In the experiment, we use two workloads to test the performance of Multi-SQL. In the test, we compare the query time using single-model databases’ query languages, the Multi-SQL without query optimization, and the optimized Multi-SQL. The specific runtime results are shown in table 6.

Workload Single-model/ms Native Multi-SQL/ms Optimal Multi-SQL/ms
W1:Q1 351ms 337ms 262ms
W1:Q2 270ms 280ms 204ms
W1:Q3 437ms 327ms 241ms
W1:Q4 404ms 213ms 195 ms
W2:Q1 30ms 174ms 193ms
W2:Q2 2260ms 2179ms 1940ms
W2:Q3 2423ms 2618ms 2574ms
W2:Q4 402ms 329ms 333ms
Table 6: Execution Time Result
(a) Average result for Workload1.
(b) Average result for Workload2.
Figure 6:

In order to present the comparison more intuitively, we draw histograms for the average of the experimental results of Workload 1 and Workload 2 in Figure 6. Figure 6(a) hows that the performance of the optimized Multi-SQL is the best with a 22% percent improvement compared with the native Multi-SQL, while the single-model and native Multi-SQL have similar performance. It is because the more coupled the data being queried, the greater the room for optimization. As is illustrated in Figure 6(b), the single-model is slightly better than Multi-SQL. It is because these queries have no room for optimization, and we have not been able to embed Multi-SQL into the databases. When a query is executed, the query is converted into a native language recognized by the current underlying database. Therefore, there is little difference in execution speed between using Multi-SQL and the query language that comes with the database. In a word, Multi-SQL spends some time in query preprocessing and converting.

In general, the optimized Multi-SQL performs well when the data model is highly coupled, and the unoptimized Multi-SQL performs similarly to the single-model. But we believe that if Multi-SQL can be truly written into the database kernel, its query performance will be significantly improved.

8 Conclusion

The multi-model database is increasingly prevalent across a wide variety of application domains. There is a requirement for the query language to allow for multi-model operations to be expressed directly. As the first multi-model query language designed from a multi-model perspective, Multi-SQL is a solidly-established declarative query language with scalability, flexibility, and efficiency.

All the work, including the formal semantics and query optimization described in this paper, will help establish a data hub that fully supports data of all models and types. In future work, Multi-SQL will be defined more finely to meet increasing demands. At the same time, more new ideas will also be applied to improve query optimization.

References

  • [1] C. J. Date. A Guide to the SQL Standard, Second Edition. Addison- Wesley, 1989. ISBN 978-0-201-50209-1.
  • [2] P. T. Wood. Query languages for graph databases. SIGMOD Record, 41(1):50–60, 2012.
  • [3] Francis N, Andrés Taylor, Green A , et al. Cypher: An Evolving Query Language for Property Graphs[C]//SIGMOD. 2018.
  • [4] Alin Deutsch, Liying Sui, Victor Vianu. XML path lan-guage (XPath) Version 1.0. W3C recommendation, The World Wide Web Consortium[C]// Journal of Computer System Sciences. 1999.
  • [5] Hiroyuki Kato, Yasunori Ishihara, and Torsten Grust. 2017. DDO-Free XQuery. DBPL ’17. Association for Computing Machinery, New York, NY, USA, Article 4, 1–13.
  • [6] Gueni B , Abdessalem T , Cautis B , et al. Pruning Nested XQuery Queries[C]// CIKM 2008, Napa Valley, California, USA, October 26-30, 2008. ACM, 2008.
  • [7] van Rest O, Hong S, Kim J, et al. PGQL: a property graph query language[C]//Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems. 2016: 1-6.
  • [8] Gog I, Schwarzkopf M, Crooks N, et al. Musketeer: all for one, one for all in data processing systems[C]//Proceedings of the Tenth European Conference on Computer Systems. 2015: 1-16.
  • [9] Lu J, Holubová I. Multi-model Data Management: What’s New and What’s Next?[J]. 2017.
  • [10] Lu J, Holubová I, Cautis B. Multi-model databases and tightly integrated polystores: Current practices, comparisons, and open challenges[C]//Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018: 2301-2302.
  • [11] Agrawal D, Chawla S, Contreras-Rojas B, et al. RHEEM: enabling cross-platform data processing: may the big data be with you![J]. Proceedings of the VLDB Endowment, 2018, 11(11): 1414-1427.
  • [12] Gadepally V, Chen P, Duggan J, et al. The bigdawg polystore system and architecture[C]//2016 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2016: 1-6.
  • [13] Lu J, Liu Z H, Xu P, et al. UDBMS: road to unification for multi-model data management[C]//International Conference on Conceptual Modeling. Springer, Cham, 2018: 285-294.
  • [14] Tesoriero C. Getting started with orientDB[M]. Packt Publishing Ltd, 2013.
  • [15] Klophaus R. Riak core: Building distributed applications without shared state[M]//ACM SIGPLAN Commercial Users of Functional Programming. 2010: 1-1.
  • [16] Majedi M, Ghazinour K, Chinaei A H, et al. SQL privacy model for social networks[C]//2009 International Conference on Advances in Social Network Analysis and Mining. IEEE, 2009: 369-370.
  • [17] Bourhis P, Reutter J L, Suárez F, et al. JSON: data model, query languages and schema specification[C]//Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI symposium on principles of database systems. 2017: 123-135.
  • [18] https://docs.datastax.com/en/archived/cql/3.3/cql/cqlIntro.html
  • [19] https://www.arangodb.com/docs/stable/
  • [20] Armstrong T G, Ponnekanti V, Borthakur D, et al. LinkBench: a database benchmark based on the Facebook social graph[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013: 1185-1196.
  • [21] Armstrong T G, Ponnekanti V, Borthakur D, et al. LinkBench: a database benchmark based on the Facebook social graph[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013: 1185-1196.