Querying collections of tree-structured records in the presence of within-record referential constraints

02/12/2021 ∙ by Foto N. Afrati, et al. ∙ 0

In this paper, we consider a tree-structured data model used in many commercial databases like Dremel, F1, JSON stores. We define identity and referential constraints within each tree-structured record. The query language is a variant of SQL and flattening is used as an evaluation mechanism. We investigate querying in the presence of these constraints, and point out the challenges that arise from taking them into account during query evaluation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Systems that analyze efficiently complex data (e.g., graph, or hierarchical data) are ubiquitous. Such systems include document databases (e.g., MongoDB [8]), or systems combining a tree-structured logical data model and a columnar storage, such as F1 [30], Dremel [25]

and its open source alternatives Apache Drill

[1] and Parquet [2].

Identity and referential constraints (a.k.a., keys and foreign keys) have been extensively studied in the context of relational databases and, lateron, in the context of XML data model [9, 16, 19], as well as for graphs [18] and RDF data [11, 23]. Recently, JSON data models have been proposed and key constraints analyzed [27, 7].

In this paper, we consider a tree-record data model, such as the one proposed in [6], for representing collections of tree-records. We define identity and referential constraints within each tree-structured record. Unlike relational databases and XML data, where such constraints are used for validating the data, in this work, we take advantage of them in querying answering. These constraints are used to improve querying processing as well as provide a way to significantly compress the size of the data. We consider SQL-like query language (such as the one used in Dremel/BigQuery, F1, Drill), and we investigate querying in the presence of such a set of constraints. To the best of our knowledge this is the first work investigating the problem of querying with SQL collections of tree-records in the presence of within-record constraints.

There are many technical challenges that need to be addressed. The definition of keys is based on equality of values. In the relational model this is straightforward, but, in tree-structured data, we need first define precisely when we say that two instances of the schema are equal. We do that in a manner similar to such definitions for other tree-structured data such as XML, e.g., [10]. We define the semantics of identity and reference constraints on the data by using flattening, and showing how to compute a query on the flattened data which consists of tables like relational data. However, the traditional flattening is not able to handle keys that are referred to by more than one foreign key. This is a new challenge that we address for the first time in this paper. Another challenge lies in the definition of keys and foreign keys so that inconsistencies do not appear; we discuss this and point out to further research. Thus, our contributions are:

  1. We define within-record key and foreign key constraints for the tree-structured data model used in many commercial databases like Dremel, F1, JSON (Section 3).

  2. We show when these concepts are well-defined (Section 4.3).

  3. We show how to use flattening to answer SQL-like queries, such as the ones used in Dremel/BigQuery (Section 4). We also introduce the concept of relative flattening which is a part of the flattened data corresponding to a subtree of the schema.

2 Defining the data model

In this section, we present a theoretical data model, called tree-record data model, that organizes elements of data into collections of tree-structured records. This model is mainly inspired by the Dremel data model [25, 6, 5], but it applies to document-oriented (i.e., XML, JSON) data stores (e.g., ElasticSearch, MongoDB) and relational databases supporting hierarchical data types (e.g., JSON type in PostgreSQL, MySQL and Drill, struct type in Hive).

The tree-record data model considers collections of records (or, tables) conforming to a predefined schema. Here, we consider a nested, tree-structured schema, called tree-schema (schema, for short) which uses the conventional, primitive data types (such as integer, string, float, Boolean, etc.) to store the data and a complex data type, called group type111Group type is mainly used by legacy Dremel syntax and recently replaced by the concept struct in Google BigQuery., to define the relationships between the data values and describe nested data structures.

A group type (or simply group) is a complex data type defined by an ordered list of items (also called attributes or fields) of unique names which are associated with a data type, either primitive type or group type. In fact, group type could be thought of as an element in XML, an object in JSON, a Dictionary in Python, or as a Struct type in other data management systems (e.g., SparkSQL, Hive).

We use a multiplicity constraint (also called repetition) to specify the number of times a field is repeated within a group. Formally, the repetition constraint for a field can take one of the following values with the corresponding annotation:

  • required: is mandatory, and there is no annotation,

  • optional: is optional (i.e., appears or times) and is labeled by ,

  • repeated: appears or more times and is labeled by ,

  • required and repeated: appears or more times and is labeled by .

We denote as the set required, repeated, optional, required and repeated of repetition types. Note that a repeated field can be thought of as an array of elements (repeated types) in JSON structures. We now represent the tree-schema [6] by a tree, as follows.

(tree-schema) A tree-schema of a table is a tree with labeled nodes such that

  • each non-leaf node (called intermediate node) is a group and its children are its attributes,

  • each leaf node is associated with a primitive data type,

  • each node (either intermediate or leaf) is associated with a repetition constraint in , and

  • the root node is labeled by the name of the table.

For the sake of simplicity, we hide the primitive data types of leaves in the graphical representations of the schemas. As for the intermediate nodes, since the type of each such node in is group, typically includes the structure defined by the subtree rooted at . Each non-required node of the schema is called annotated node. When we de-annotate a node, we remove the repetition symbol from its label, if it is annotated; otherwise, we keep its label as is. represents the label of a node in a schema. Considering the nodes , of a schema , such that is a descendant of , we denote by the path between and in .

(reachability path) The path between the root and a node of a schema is called reachability path of .

We refer to each node in a schema through its reachability path. We omit a prefix in the reachability path of a node if we can still identify the node through the remaining path in the schema.

Intuitively, the leaves in a tree-schema represent the “data-keepers", while all the intermediate nodes describe the relationships between the data stored into leaves. To see this, let us define an instance of a tree-schema . Considering a subtree of , we denote as dummy the tree constructed from by de-annotating all the annotated nodes of and adding to each leaf a single child which is labeled by the -value. In addition, we denote as the tree produced by de-annotating only the root of .

(tree-instance of a given schema) Let be a tree constructed from by replacing the annotated nodes of , from top to down, as follows:

  • for each repeated node of , we replace the subtree rooted at with either a dummy , or subtrees, where ,

  • for each optional node of , we replace the subtree rooted at with either a dummy or a single subtree,

  • for each required and repeated node of , we replace the subtree rooted at with subtrees, where .

Then, for each non- leaf of , we add to a single child which is labeled by a value of type that matches the primitive type of . The tree is a tree-record of . An instance of , called tree-instance, is a collection of tree-records (not necessarily a set).

Consider the table Booking with schema depicted in the Figure 1. At this stage, we ignore the -labeled edges, which will be defined in the next section. The Booking table stores data related to reservations; each record in the table represents a single reservation. As we see in , the Booking group includes a repeated and required field, denoted as ; which means that each booking-record includes one or more services booked by the customer. The field describing the service type is a mandatory field (i.e., required) and takes values from the set accommodation, transfers, excursions. and groups could include additional fields, such as date the reservation booked, start and end date of the service, that are ignored here due to space limitation. Figure 2 illustrates a tree-record of .

In the figures, all the dummy subtrees are ignored . The reachability path of a node in a tree-record is similarly defined as the path from the root of the tree-record to .

In this paragraph, we define an instantiation in a multiset rather than in a set notion. Let be a schema and be a tree-record in an instance of . Suppose that each node in both and has a unique virtual id. Since each node of is replaced by one or more nodes in , there is at least one mapping , called instantiation, from the node ids of to the node ids of , such that ignoring the annotations in , both the labels and the reachability paths222Note that the reachability path of a node is given in terms of labels. of the mapped nodes match. The subtree of which is rooted at the node is an instance of the node , where is a node of . If is a leaf, an instance of is the single-value child of .

We say that two subtrees , of a tree-record are isomorphic if there is a bijective mapping from to such that the labels of the mapped nodes match. We say that and of are equal, denoted , if they are isomorphic and the ids of the mapped nodes are equal.

3 Within-record constraints

In this section, we define identity and referential constraints that hold on each tree-record.

Initially, we analyze the intuition behind the within-record constraints, and specifically, focus on identity and uniqueness constraints. In the conventional databases, we use identity (and uniqueness) constraints (primary keys and unique constraints) to specify that the values of certain columns are unique across all the records of a table instance. Since tree-record data model allows repetition of values within each record, we might have fields that uniquely identify other fields (or subtrees) in each record, but not the record itself [10]. In Example 2, each service has an identifier which is unique for each service within a reservation, but not unique across all the reservations in the Booking table. To support such type of constraints in a tree-record data model, we define the concept of identity constraint with respect to a group.

(identity constraint) Let be a tree-schema, be a tree-instance of , and be an intermediate node of . An identity constraint with respect to is an expression of the form , where the node is a descendant of , such that and all the descendants of in are required. Suppose that is the lowest repeated ancestor of and is the parent of . We say that the constraint is satisfied in if the following is true for each : for each instance of in , and for every two instances and of in , we have that if , then , where , are instances of in , , respectively. If is the root of the tree then .

The identity constraint is similar to the concept of relative key defined for XML documents in [10]. Intuitively, if the constraint is satisfied in then for each tree-record in , all the instances of are unique in each subtree of rooted at an instance of the parent of . In such a case we also say that the field is an identifier of . The group is also called the range group of the identifier . If , where is the root of , then all the instances of are unique in each tree-record. According to the previous definition, might be a non-repeated field. If is a child of , then is always unique in the subtrees rooted at the instances of the parent of ; since for each instance of , there is a single instance of . In essence, the identifier could be thought of as the primary key of either the upper repeated descendant of the group , or the group itself, in each tree-record.

In the tree representation of a schema, we annotate the identifiers using the symbol . In particular, supposing a schema and , where , are groups of , the -node is labeled as . We also use a special, dotted edge , called identity edge, to illustrate the range group of the identifier ; if the range group of is its parent we omit such an edge, for simplicity.

Figure 1: Booking Schema - Tree-record model with references

Following the Example 2, each reservation-record also includes a list of passengers which is given by the required and repeated field . The includes 3 fields , , , where the last one is optional for each passenger instance. The field is mandatory for each passenger and uniquely identifies the name and the location of each passenger; hence, the is an identifier of the and the corresponding identity constraint is given as follows: . Note here that the range group of the is its parent; hence, we ignore the corresponding identity edge.

To see the impact of the range group, let us compare the following two constraints: and . The field in the former case (i.e., the one illustrated in Figure 1) is a composite identifier of its parent and consists of two location ids; i.e., the combination of From and To locations uniquely identifies the transfer instances within each service, but not across all the services of the booking. On the other hand, setting the range group of the to (i.e., the latter constraint), the combination of From and To locations are unique across all the transfer services in each booking-record.

We now define the concept of referential constraint (or, simply reference), which intuitively links the values of two fields. In essence, the concept of reference is similar to the foreign key in relational databases, but, here, it is applied within each record.

(referential constraint) Let , and be nodes of a tree-schema such that is an identifier of , is not a descendant of , and , have the same data type. A referential constraint is an expression of the form . A tree-instance of satisfies the constraint , if for each tree-record the following is true:
For each instance of the lowest common ancestor (LCA) of and in , the following happens: each instance of in is isomorphic to an instance of in .

If is a leaf of , then the constraint is called simple. We say that the node , called referrer, refers to the node which is called referent. To represent a reference constraint in a tree-schema , we add to a special (dashed) edge , called reference edge, which is labeled by . Let be the set of identity and referential constraints over . Consider now the tree given by (1) ignoring all the reference and identity edges, and (2) de-annotating the identifier nodes. We say that a collection is a tree-instance of the tree-schema in the presence of if is a tree-instance of and satisfies all the constraints in .

Figure 2: Booking instance - Tree-record with references

Continuing the Example 3, we can see that the schema depicted in Figure 1 includes two referrers of the , and , specifying the passengers that booked each service and the passengers taking each transfer, respectively. Note that in the previous paths, we omit the prefix as we already mentioned earlier. Furthermore, the is a referent in four references defined. Notice also that the group is a composite identifier and consists of two fields that both refer to the identifier.

Consider now the tree-record of which is depicted in Figure 2. It is easy to see that this reservation includes 3 services booked for two passengers. The first service is booked for the first passenger, while the services with ids 2 and 3 are taken by both passengers. In each service, there is a list of passenger ids representing the passengers taking each service. Those fields refer to the corresponding passengers in the passenger list of the booking; appropriate reference edges illustrate the references.

4 Querying tree-structured data

In this section, we investigate querying tree-structured tables in the presence of identity and referential constraints. Although navigation languages (e.g., XPath [3], XQuery [4], JSONPath [21]) are mainly used to query tree-structured data, here, we focus on a SQL-like query language. In particular, we use the Select-From-Where-GroupBy (SFWG) expressions used in Dremel [25, 6] to query tables defined through a tree-schema. We assume that each field used in SFWG queries is a leaf node in the tree-schema and is given through its reachability path. We refer to such a query language as Tree-SQL. Typically, we focus on filtering, projection and grouping (and aggregation) operations over the tables333We do not consider joins, recursion, nested queries and within-aggregations [25], as well as operations that are used to build a tree-like structure at query-time, or as a result of the query (e.g., the -like functions in PostgreSQL).. Here, we consider only simple references444Querying schemas having references to intermediate nodes is considered a topic for future investigation..

A query determines a mapping from tree-structured data model to relational model; i.e., it unnests the tree-structured data and transforms the tree-records into tuples. To analyze the semantics of a query in more detail, we initially ignore the references and identifiers. Let us consider, for example, the following simple query over the table with schema depicted in Figure 1:

ΨΨSELECT Voucher, Destination, Operator.Name
ΨΨFROM Booking
ΨΨWHERE Operator.Country=’GE’;
Ψ

When is applied on an instance of it results a relation, also denoted , including a single tuple for each tree-record in such that the instance of the field in is ’GE’. Now, each tuple in includes the voucher of the booking, the destination and the name of the operator (if it exists - otherwise, the -value). For example, if includes the tree-record illustrated in Figure 2, then includes the tuple (, , ).

In the previous example, we can see that the fields used in both and clauses do not have any repeated field in their reachability path. Querying the instances of such kind of fields is similar to querying a relation consisting of a column for each field. The tuples are constructed by assigning the single value of each field, in each record, to the corresponding column. The challenge comes up when a repeated node exists in the reachability path of a field used in the query; since such a field might have multiple instances in each tree-record. To formally define the query semantics and handle repetition, we use the concept of flattening [6] which is discussed in detail in the next section.

4.1 Flattening nested data

In this section, we analyze the flattening operation applied on tree-structured data. Flattening is a mapping applied on a (tree-structured) table and translates the table to a relation. By defining such a mapping, the semantics of Tree-SQL is given by the conventional SQL semantics over the flattened relation (i.e., the result of the flattening over the table). Initially, we consider a tree-schema without referential and identity constraints. The presence of constraints is discussed in the next section.

Let be a tree-schema of a table and is an instance of , such that there is not any reference defined in . Suppose also that are the leaves of . The flattened relation of , denoted , is the relation given by the multiset: is an instantiation of a tuple . For each pair , and belong to the same instance of the lowest common ancestor of and in . Considering now a query over and an instance of , we say that is evaluated using full flattening, denoted , if is given by evaluating over the relation . It’s worth noting here that if does not have any repeated field then the flattened relation of includes tuples; otherwise, each record in can produce multiple tuples during flattening.

(a)
(b)
Figure 3: Explaining full and relative flattening - (a) Schema, (b) instance

Let be a table with schema depicted in Figures 2(a), and instance including only the tree-record depicted in Figures 2(b). The flattened relation is , , . Consider now the query : , . typically applies a projection over the flattened relation; hence, it results three tuples.

Looking at the Example 4.1, we can see that in has two instances, and . However, results three tuples, since the evaluation is affected by the repetition of . The issue becomes more misleading when we use aggregations. To avoid cases where the repetition of a field that is not used in the query has an impact on the query result, we define the concept of relative flattening. Let be a tree-schema of a table and is an instance of , such that there is not any reference defined in . Consider also a query over that uses a subset of the set of leaves of , and the tree-schema constructed from by removing all the nodes except the ones included in the reachability paths of the leaves in . Then, we say that a query is evaluated using relative flattening, denoted , if is given by evaluating over the relation: is an instantiation from the nodes of to the nodes of . Continuing the Example 4.1, we have that , . Consider a query over a tree-schema such that does not apply any aggregation. Then, for every instance of , the following are true:
(1) there is a tuple in if and only if there is a tuple in ,
(2) .

In the next section, we give query semantics based on relative flattening; i.e., .

4.2 Navigating through references

In the previous section, we ignored the presence of constraints when we explained how to use flattening to answer an SQL-like query. Here, we show how we take advantage of the constraints.

Let us start our analysis by looking at the schema in Figure 1. Let be an instance of . Suppose now that we want to find, for all the transfer services in , their vouchers, along with the following transfer information: vehicle of each transfer and the route expressed as a combination of From and To cities. Note that this query cannot be answered based on the query semantics defined in the previous section555If the data is structured as in Figure 1, such a query cannot be answered using SFWG queries in Dremel, as well., since the city of each location does not belong into the same Route subtree. Taking into account, however, the following constraints, it is easy to see that intuitively such a query could be answered.

To see this, we can initially search for the voucher, vehicle, and ids of the From and To Locations for each transfer service within all the bookings. Then, for each id of the From and To Locations, we look at the corresponding Location list of the same record and identify the corresponding cities. To capture such cases and use the identity and referential constraints, we initially extend the notation of the Tree-SQL as follows. Apart from the reachability paths of the leaves that can be used in , and clauses, if there are constraints and , we can use paths of the form: , where the is the reachability path of , is a leaf which is a descendant of , and is the path from to . Hence, the query answering the previous question:
SELECT Voucher, Vehicle, Route.From_Location_id.City,
Route.To_Location_id.City
FROM Booking WHERE Service.Type = ’transfer’;
Intuitively, navigating through identity and reference edges, the leaves of become accessible from . For example, in the schema in Figure 1, the leaves of the group are accessible through both and .

To formally capture queries using references, we extend the relative flattening presented in the previous section as follows. Let be a tree-schema of a table , be an instance of , and be a set of identity and referential constraints satisfied in . Consider also a query over that uses a set of leaves of such that for each in there are constraints , in , where is an ancestor of . Let also be the tree-schema constructed from by keeping only the reachability paths of the leaves in , and be the tree-schema including only the reachability paths of , and the leaves of that are included in . Both and keep the node ids from . A query is evaluated in the presence of the constraints in , denoted , if is given by evaluating over the relation: is an instantiation from the nodes of to the nodes of , each is an instantiation from the nodes of to the nodes of , for every two , s.t. , we have that , and for each , we have that and .

Posing (defined above) on an instance including the tree-record depicted in Figure 2, we have two instantiations, each of which maps on a different instance of subtree. For each such instantiation, there is a single instantiation to a instance such that the referrer value equals the value. The result is:
, .
If we replace the field with the field , the reference is not used; hence, the result includes 6 tuples computed by combination of the 2 cities of From-location, and , and all the available instances of the .

4.3 Out-of-range references and cycles

In this section, we investigate well-defined references; i.e., whether it is clear which referent is referred by each referrer in a tree-instance. We also discuss cases where the references define a cycle into the schema graph.

Figure 4: Out-of-range references and cycles

Consider the tree-schema depicted in Figure 4(a) and an instance of including the tree-record in Figure 4(b). As we can see, there are 2 referrers, and , which both refer to the identifier . The range group of is the node . Consider the queries and selecting only the fields and , respectively. Note that the result includes two times the value , while includes the value once. Let now and be the queries selecting the paths and , respectively. We can see that both and include the tuples and . Hence, when we use the reference from , the number of tuples in the result remains the same. Using however the reference from , the number of tuples in the result increases. This is because it is not clear which is the instance of that the instance of refers to. This property is captured by the following definition and proposition.

Let be a tree-schema, and , be two constraints over . Let be the lowest common ancestor of and . Then, we say that the reference is out-of-range if there is at least one repeated group on the path from to ; otherwise, the reference is within-range. Let be a reference over a tree-schema , and be a record of a tree-instance of satisfying the constraint. If is within-range, then for each instance of in , there is a single instance of in .

The property described in the Proposition 4 is very important for defining referential constraints, since setting up out-of-range constraints the queries using the references might not compute the "expected" results.

By defining references in a tree-schema, cycles of references can appear. For example, consider the table with tree-schema illustrated in Figure 4(c). The table stores the projects () of each department (), along with the employees () of the department. Each project has a number of employees working on it, and each employee of the department might be accountable for () a list of projects. Hence, it is easy to see that the references defined between and subtrees form a cycle of references. One could ask for the projects of a certain category () which employ an employee who is accountable for a project of a different category, along with the name of the employee. To answer such a question, we need to navigate through both links. The query semantics defined in the previous section allow only a single use of a reference between two subtrees. Extending the semantics to support arbitrary navigation through the references is a topic for future work.

5 Related Work

Work on constraints for tree-structured data has been done during the past two decades. Our work, as regards the formalism, is closer to [31, 32, 10, 9]. The papers [10] and [9] are among the first works on defining constraints on tree-structured data. Reasoning about keys for XML is done in [10] where a single document XML is considered and keys within scope (relative keys) are introduced. Referential constraints through inclusion dependencies are also investigated (via path expression containment). The satisfiability problem is investigated, but no query language is considered. Many recent works investigate discovering conditional functional dependencies in XML Data; closer to our perspective is [31] and [32] where XML schema refinement is studied through redundancy detection and normalization.

[27] and [7] focus on the JSON data model and a similar to XPath navigational query language. These works also formalize specification of unique fields and references, they do not define relative keys. [27] formally defines a JSON schema. It supports specification of unique fields within an object/element and supports references to an another subschema (same subschema can be used in several parts of the schema). No relative keys are supported. [7] continues on [27] and proposes a navigational query language over a single JSON document (this language presents XPath-like alternatives for JSON documents, such as JSONPath, MongoDB navigation expressions and JSONiq).

Flattening has initially been studied in the context of nested relations and hierarchical model (e.g., [29, 12, 26]). Dremel [25, 6], F1 [30] and Drill [1] use flattening to answer SQL-like queries over tree-structured data. Flattening semi-structured data is also investigated in [15, 24, 14], where the main problem is to translate semi-structured data into multiple relational tables.

6 Future work

As next steps, we plan to investigate querying tree-schemas having references to intermediate nodes and/or reference cycles. Also, we aim to study flattening when the referrer is defined in the range group of the referent. Furthermore, we plan to extend this investigation towards the following directions: a) Study the satisfiability and the implication problems for the constraints we defined here. b) The chase [28] is used to reason about keys and functional dependencies. For relational data, there is a lot of work on chase. The chase for RDF and graph data was studied in [22], [17, 20], [13] and [18]. We plan to define a new chase that can be applied to reason about the constraints we defined here.

References

  • [1] Apache Drill Project. https://drill.apache.org/.
  • [2] Apache Parquet Project. https://parquet.apache.org/.
  • [3] XPath. http://www.w3.org/TR/xpath/.
  • [4] XQuery. http://www.w3.org/TR/xquery/.
  • [5] Foto N. Afrati and Rada Chirkova. Answering Queries Using Views, Second Edition. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2019.
  • [6] Foto N Afrati, Dan Delorey, Mosha Pasumansky, and Jeffrey D Ullman. Storing and querying tree-structured records in dremel. Proc. VLDB Endow., 7(12):1131–1142, 2014.
  • [7] Pierre Bourhis, Juan L. Reutter, Fernando Suárez, and Domagoj Vrgoc. JSON: data model, query languages and schema specification. In PODS 2017, pages 123–135. ACM, 2017.
  • [8] Shannon Bradshaw, Eoin Brazil, and Kristina Chodorow. MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. O’Reilly Media, 2019.
  • [9] Peter Buneman, Susan B. Davidson, Wenfei Fan, Carmem S. Hara, and Wang Chiew Tan. Keys for XML. Comput. Networks, 39(5):473–487, 2002.
  • [10] Peter Buneman, Susan B. Davidson, Wenfei Fan, Carmem S. Hara, and Wang Chiew Tan. Reasoning about keys for XML. Inf. Syst., 28(8):1037–1063, 2003.
  • [11] Diego Calvanese, Wolfgang Fischl, Reinhard Pichler, Emanuel Sallinger, and Mantas Simkus. Capturing relational schemas and functional dependencies in RDFS. In AAAI 2014, pages 1003–1011. AAAI Press, 2014.
  • [12] Latha S. Colby. A recursive algebra and query optimization for nested relations. In SIGMOD 1989, pages 273–283, 1989.
  • [13] Alvaro Cortés-Calabuig and Jan Paredaens. Semantics of constraints in RDFS. In AMW 2012, volume 866 of CEUR Workshop Proceedings, pages 75–90. CEUR-WS.org, 2012.
  • [14] Alin Deutsch, Mary F. Fernández, and Dan Suciu. Storing semistructured data with STORED. In SIGMOD 1999, pages 431–442, 1999.
  • [15] Michael DiScala and Daniel J. Abadi. Automatic generation of normalized relational schemas from nested key-value data. In SIGMOD 2016, pages 295–310, 2016.
  • [16] Wenfei Fan. XML constraints: Specification, analysis, and applications. In 16th International Workshop on Database and Expert Systems Applications (DEXA 2005), pages 805–809. IEEE Computer Society, 2005.
  • [17] Wenfei Fan, Zhe Fan, Chao Tian, and Xin Luna Dong. Keys for graphs. Proc. VLDB Endow., 8(12):1590–1601, 2015.
  • [18] Wenfei Fan and Ping Lu. Dependencies for graphs. ACM Trans. Database Syst., 44(2):5:1–5:40, 2019.
  • [19] Wenfei Fan and Jérôme Siméon. Integrity constraints for XML. J. Comput. Syst. Sci., 66(1):254–291, 2003.
  • [20] Wenfei Fan, Yinghui Wu, and Jingbo Xu. Functional dependencies for graphs. In SIGMOD Conference 2016, pages 1843–1857. ACM, 2016.
  • [21] Stefan Gössner and Stephen Frank. Jsonpath (2007). URL http://goessner. net/articles/JsonPath, 2007.
  • [22] Jelle Hellings, Marc Gyssens, Jan Paredaens, and Yuqing Wu. Implication and axiomatization of functional and constant constraints. Ann. Math. Artif. Intell., 76(3-4):251–279, 2016.
  • [23] Georg Lausen, Michael Meier, and Michael Schmidt. Sparqling constraints for RDF. In EDBT 2008, volume 261, pages 499–509. ACM, 2008.
  • [24] Zhen Hua Liu, Beda Christoph Hammerschmidt, and Douglas Mcmahon. JSON data management: supporting schema-less development in RDBMS. In SIGMOD 2014, pages 1247–1258, 2014.
  • [25] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., 3(1-2):330–339, 2010.
  • [26] Jan Paredaens and Dirk Van Gucht. Possibilities and limitations of using flat operators in nested algebra expressions. In PODS 1988, pages 29–38, 1988.
  • [27] Felipe Pezoa, Juan L. Reutter, Fernando Suárez, Martín Ugarte, and Domagoj Vrgoc. Foundations of JSON schema. In WWW 2016, pages 263–273, 2016.
  • [28] Fereidoon Sadri and Jeffrey D. Ullman. The interaction between functional dependencies and template dependencies. In ACM SIGMOD Conference 1980, pages 45–51. ACM Press, 1980.
  • [29] Marc H Scholl, H-Bernhard Paul, Hans-Jörg Schek, et al. Supporting flat relations by a nested relational kernel. In VLDB, pages 137–146, 1987.
  • [30] Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whipkey, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Himani Apte. F1: A distributed SQL database that scales. Proc. VLDB Endow., 6(11):1068–1079, 2013.
  • [31] Loan T. H. Vo, Jinli Cao, and J. Wenny Rahayu. Discovering conditional functional dependencies in XML data. In ADC 2011, volume 115 of CRPIT, pages 143–152, 2011.
  • [32] Cong Yu and H. V. Jagadish. XML schema refinement through redundancy detection and normalization. VLDB J., 17(2):203–223, 2008.