Schema Validation and Evolution for Graph Databases

02/18/2019 ∙ by Angela Bonifati, et al. ∙ 0

Despite the maturity of commercial graph databases, little consensus has been reached so far on the standardization of data definition languages (DDLs) for property graphs (PG). The discussion on the characteristics of PG schemas is ongoing in many standardization and community groups. Although some basic aspects of a schema are already present in Neo4j 3.5, like in most commercial graph databases, full support is missing allowing to constraint property graphs with more or less flexibility. In this paper, we focus on two different perspectives from which a PG schema should be considered, as being descriptive or prescriptive, and we show how it would be possible to switch from one to another as the application under development gains more stability. Apart from proposing concise schema DDL inspired by Cypher syntax, we show how schema validation can be enforced through homomorphisms between PG schemas and PG instances; and how schema evolution can be described through the use of graph rewriting operations. Our prototypical implementation demonstrates feasibility and shows the need of offering high-level query primitives to accommodate flexible graph schema requirements as showcased in our work.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Property graph databases are modern data management systems that use graph structures, such as nodes, edges and properties, to encode semantically complex data. Graph database technology has made tremendous progress with many commercial products—such as Neo4j, Oracle PGX, SAP HANA Graph, Redis Graph, Cypher for Apache Spark and TigerGraph—and yet little consensus has been reached so far on the standardization of graph data querying and manipulation or of data definition languages (DDLs).

The aim of ISO SC32/WG3 is to develop a new international standardized query language—called GQL222https://www.gqlstandards.org/—for property graphs, with support from the activities of the wider community such as OpenCypher333http://www.opencypher.org/ and G-Core [1]. Standardization of graph data querying and manipulation is therefore well under way. However, schema specification for graphs along with a proposal for a graph data definition language have only recently started to be discussed within a standardization working group of ISO SC32/WG3 as well as in community working groups. Indeed, there are only a few examples of property graph systems offering schema and DDL, e.g. Neo4j’s Cypher for Apache Spark and TigerGraph.

Neo4j 3.5 already provides the means to express certain basic aspects of schemas. Specifically, the use of unique property and property existence constraints—or, more generally, of node keys—on node and edge labels enables us to enforce nodes (or edges) to have certain properties that moreover uniquely characterize that node (or edge). However, this does not allow users to express more advanced aspects of schemas such as specifying, for a given node or edge label, the collection of all possible associated properties; or constraining whether or not an edge may exist between nodes with certain labels.

The schemas that (property) graph database systems typically provide are descriptive in the sense that they only reflect the data: the schema can be changed simply by manipulating the data instance directly with no particular restrictions on any such manipulations. The flexibility that this entails is generally perceived as a valuable characteristic, particularly in the earlier stages of application development, and especially in conjunction with the now ubiquitous agile software development method. A system that allows for the structure of graph elements to be manipulated and refactored freely, as the understanding and modelling of an application’s universe evolves, greatly simplifies the development process in its early stages.

As applications mature, however, a gradual shift in priorities occurs. As a concept becomes more stable, well-established, and central in our data model, we must treat it with increasing caution when considering further modifications. The demand for restrictive schema manipulation policies further increases when an application goes into production since data becomes precious and misshaped data can have large financial consequences. By this stage, traditional prescriptive schemas are the more appropriate choice.

Fully-specified schemas are indeed required by several Neo4j use cases, including insurance and pharmaceutical customers and GDPR444https://eugdpr.org/ compliance setups. Contrarily to relational settings, the schema requirements are oftentimes imposed by the evolving application rather than being encoded in the database at the very beginning of the build cycle. As such, schema evolution is not handled as in the relational setting by a dedicated team, but rather by the application team/business units.

However, descriptive and prescriptive schemas are only the two extremes of a spectrum of practical agility needs that application teams need to deal with. Moreover, applications in productive use still continue to evolve and some schema changes are inevitable. As an application and its schema grows, some parts of the schema mature faster than others giving rise to a different trade-off, between modification flexibility and demand for restriction, within a single schema. For instance, the beta-available and under-development recommendation system added to an established shopping system requires more flexibility in schema than the shopping cart and order processing component. Additionally, certain parts of an application may require greater flexibility independently of their state of maturity. For instance, the product catalogue of the shopping system requires prescriptive properties, such as article number and prices, next to the perpetually changing set of descriptive properties that its products exhibit.

Modern database systems should support all these scenarios and allow users to change their flexibility/restriction requirements. Traditional database schema approaches have been shaped by requirements originating from applications and application development methods that were been state-of-the-art many decades ago. Today, the situation is different and the number of attempts [6, 25, 10, 34, 19, 20] to work around the restrictions of traditional schema approaches encoded in many database systems provide evidence of this. As standardization of schema and DDL for PG database systems is just starting, we have a golden opportunity to consider these modern requirements on schema agility capabilities from the very beginning.

With this paper, we propose an PG schema approach that aims to accommodate modern schema requirements for property graph databases and offers support for prescriptive as well as descriptive schema in a very flexible fashion. We make the following specific contributions:

  • a schema model specifying labels and (mandatory) properties for nodes and edges with mixing composition, guaranteed to be backward compatible with the flexible use of labels in today’s PG databases while still facilitating strict typing of every graph element (Section 2);

  • a concise schema DDL with visually intuitive ASCII-art syntax inspired by Cypher (Section 2);

  • a mathematical framework for schema validation allowing us to construct both instances and schemas as property graphs and to enforce schema validation through the existence of a homomorphism from instance to schema (Section 3);

  • mathematically specified graph rewriting rules [9] and their application to update instances and/or schemas and propagate these changes from schema to instance (or vice versa) while keeping the instance and schema consistent at all times (Section 5);

  • a discussion of the requirements for graph refactoring and schema evolution based on the use of graph rewriting to express such operations mathematically (Sections 4, 6 and 8);

  • a prototypical implementation demonstrating feasibility and showing the need of offering high-level primitives for schema validation and evolution in a PG query language; such prototypes builds on a Python library, called ReGraph, that allows rewriting and propagation by means of Cypher queries using Neo4j as backend (Section 7).

2 PG Schema Language

We introduce in this section an OpenCypher-based schema DDL for Property Graphs (PG). Such a DDL is the outcome of extensive discussions at Neo4j about the graph DDL requirements and the possible extension of OpenCypher. Although informing and feeding the ongoing standardization process, our DLL must not be intended as a standard proposal since its main purpose is to substantiate the algorithmic contributions presented in the remainder of the paper. The basic components of a schema definition assume a finite set of labels , a set of property keys  and a finite set of data types .

Property graph type. A property graph type is a triple where is a set of element types, is a set of node types and is a set of edge types. A property graph type provides the schema for a PG. Multiple PGs can share a property graph type to the effect that they will have the same schema.

Property type. A property type is a pair , where is the property key and is its data type. For instance, “content: STRING” declares the property type .

Element type. An element type is a 4-tuple , where is a label, is a set of property types, is a subset of mandatory property types and is the set of element types that extends.

Hence, “Message \{content: STRING?, length: INTEGER\}” is a declaration of the element type , where
and ; while “Post :: Message \{language: STRING?\}” declares the element type .

An element type is allowed to extend multiple other element types, but must not extend itself either directly or indirectly. All element types of a property graph type must be disambiguated by their label. Where clear from context, we use the label to denote the corresponding element type.

Exposed (mandatory) property types and labels. The set of exposed property types of an element type is defined as , i.e. all the property types that possesses, either directly or through inheritance. Similarly, we define to be the set of exposed mandatory property types of and to be the set of exposed labels of . For instance, for element type from above we have , , and .

For an element type to be valid, must not have two or more property types with the same property key, i.e. all properties types of a element type are disambiguated by their property key. Where clear from context, we will use the property key to denote the corresponding property type. For instance, for the element type above, we have , and
. Note that is unambiguous for all element types of a property graph type.

Node type. A node type is a 1-tuple , where is an element type. For instance, “(Post)” declares the node type . For a node type , we define , , and .

Edge type. An edge type is a triple , where , , and are element types. For instance, the edge type can be declared with “(Comment)-[REPLY_OF]->(Message)”. Exposed (mandatory) property and label sets are defined analogously to node types based on . Note that and need not be node types. This allows a single edge type to be inherited by multiple node types.

Example. The following snippet of the OpenCypher PG schema DDL creates a property graph type that captures an excerpt of the LDBC SNB [13] schema 555The complete PG schema encoding of LDBC SNB is illustrated in the Appendix.

CREATE GRAPH TYPE snb (
  // element types
  Person {
    firstName : STRING, lastName : STRING
  },
  Message {
    creationDate : TIMESTAMP, browserUsed : STRING
  },
  Comment <: Message {},
  Post <: Message {
    imageFile : STRING?
  },
  REPLY_OF {},
  // node types
  (Person), (Post), (Comment),
  // edge types
  (Person)-[KNOWS]->(Person),
  (Person)-[LIKES]->(Message),
  (Message)-[HAS_CREATOR]->(Person),
  (Comment)-[REPLY_OF]->(Message)
)

3 Schema Validation

In this section, we provide a mathematical formalization of our notion of schema that, in particular, allows us to interpret a DDL specification as a PG. Apart from providing an intuitive visualization of the property graph type, this allows us to use the formalism for rewriting schemas in Sections 5 and 6.

Schema validation according to which an instance graph respects the schema can then be viewed as a homomorphism, i.e. a structure-preserving function, from the instance to the schema. We present the mathematical definitions of schemas and instances as property graphs in Section 3.1 and then discuss the application of homomorphisms to the schema validation problem in Section 3.2.

3.1 Schemas and instances as property graphs

We fix countable sets , and of objects, keys and values respectively. For the purposes of this paper, we assume that contains (at least) basic types of integers, booleans, strings and dates.

A property graph is defined to be a tuple where and are disjoint, finite subsets of called nodes and edges; is a function assigning a source and target node to each edge; is a finite set of properties; is a finite relation, assigning sets of values to properties; and is a set of mandatory properties. The requirement that be finite means that each node and each edge has finitely many properties, each of which has a finite set of associated values.

A schema specified in our DDL from Section 2 can be interpreted as a property graph in the following way. The nodes are the node types and we have an edge from to in if, for some and , there is an edge type . Note that a node type always gives rise to a single node of whereas an edge type may give rise to many edges in the schema graph; this is how inheritance in the DDL syntax is ‘expanded out’ in the schema graph interpreting the property graph type. Each node and edge has the (mandatory) properties specified by its corresponding node or edge type. As an example, the schema defined in Section 2 and interpreted as a property graph is illustrated in Figure 1.

Person

firstName: STRINGlastName: STRING

Post

imageFile: STRING?creationDate: STRINGbrowserUsed: STRING

Comment

creationDate: STRINGbrowserUsed: STRING

KNOWS

HAS_CREATOR

HAS_CREATOR

LIKES

LIKES

REPLY_OF

REPLY_OF
Figure 1: An extract from the SNB schema

In this paper, we restrict our attention to simple graphs, i.e.  is injective so we do not have parallel edges. This simplifies some of the technical details but it would be a straightforward matter to extend our results and implementation to the general case.

However, our definition of property graph has three differences from that found in [8]: we have removed node and edge labels; we have added the notion of mandatory property; and we allow to be multi-valued rather than just a single-valued partial function.

In our mathematical framework, a graph schema and a graph instance are both represented as property graphs. The designation of one as the schema and the other as an instance is determined by the fact that we can map the latter to the former in a way that respects the graph structure. This is why we have removed labels from the definition of property graph: we can now think of the label( set)s of as being the nodes/edges of .

However, this notion of schema validation only allows us to express optional properties; as such, we have added the notion of mandatory property to our notion of property graph, as discussed in Section 3.2, to be able to enforce the presence of properties in .

The third difference arises due to the unavoidable fact that an update, or rewrite, of a property graph may cause a property to become associated with more than one value, as discussed in Section 5, typically due to the merging of nodes.

3.2 Schema validation via graph homomorphisms

Let and be property graphs where and are disjoint. A homomorphism is a function and a function , mapping nodes and edges of to nodes and edges of , such that . We write . We further require that (i) if then ; (ii) if then ; and (iii) if then .

In words, each edge of with source and target nodes and is mapped to an edge of with source and target nodes and . We further require that (i) all properties of are instances of properties of ; (ii) each property in is associated with a subset of the values associated with its corresponding property in ; and (iii) any instance of a mandatory property of must be mandatory in .

In the case of simple graphs, we do not need to specify the second function ; it is enough to ask that, for all pairs of nodes of , if there is an edge from to in then there must exist a (necessarily unique) edge from to .

firstName: BrynlastName: Davies

imageFile: photo33711.jpgcreationDate: 2010-10-16browserUsed: Firefox

firstName: JoselastName: Alonso

creationDate: 2010-10-30browserUsed: Firefox

firstName: JanelastName: Murray

creationDate: 2010-10-30browserUsed: Safari

creationDate: 2010-10-30browserUsed: Safari

LIKES

KNOWS

KNOWS

LIKES

HAS_CREATOR

HAS_CREATOR

HAS_CREATOR

LIKES

REPLY_OF

HAS_CREATOR

REPLY_OF
Figure 2: A valid instance of the SNB schema extract

We can view a homomorphism as a formalization of the notion schema validation, i.e. that respects the ‘schema’ : each node/edge of is an instance of the schema node/edge ; edges in constrain which edges can exist in ; and properties that are mandatory in the schema are mandatory (so must occur) in . In the example instance of Figure 2, we have used colours to encode the homomorphism , i.e. all yellow nodes are Comments, etc. In the DDL of Section 2, the fact that all element types are disambiguated by their label would also allow us to determine provided we include these labels in the instance .

The ReGraph library. The Python library ReGraph666https://github.com/Kappa-Dev/ReGraph provides a prototypical implementation of the presented system. It enables us to construct property graphs and structure them into hierarchies (DAGs) of graphs via homomorphisms. Although ReGraph can handle arbitrary hierarchies of graphs, in this paper we limit our use of the library to the special case of two graphs connected by a single homomorphism, i.e. . This is sufficient for our purposes since, as shown in Section 3, we can express and enforce schema validation with a hierarchy of precisely this kind.

4 Schema manipulation

In order to support schema agility, as is needed by applications today, a graph database system should provide schema manipulation operations (SMOs). The following schema modifications are the minimum a system should possess in order to provide good coverage of the modifications required in practice and theory [11, 18].

Create.

The user can add new element types, node types and edge types to the schema.

Drop.

The user can remove element types, node types and edge types from the schema.

Rename.

The user can rename the label of an element type or the property key of a property type.

Change.

The user can change element types by adding, removing or changing property types.

Partitioning/Split.

The user can split or partition a node or edge type into more fine-grained node or edge types. This is a common operation when a schema grows and gets further normalized so as to separate concepts. Partitioning can happen horizontally as well as vertically. In a (horizontal) partitioning, the user requires the distinction of Message into Post and Comment. In a (vertical) split, the user separates Message into MessageHeader and MessageBody.

Union/Join.

The user can join or union node and edge types into more coarse-grained node and edge types. This is needed when a schema shrinks or conceptual distinctions are generalized. In a (horizontal) union, the user gives up distinguishing Post and Comment to consider simply Message. In a (vertical) join, the user gives up distinguishing MessageHeader and MessageBody to consider simply Message.

Depending on the current state of maturity of a schema, users may want perform such operations from schema to data or from data to schema. From schema to data—the traditional prescriptive schema evolution—is desirable for mature schemas in productive systems. The user specifies how the schema is to change and the system propagates these changes to the data. From data to schema—the descriptive schema manipulation—is desirable for more agile scenarios where the schema simply follows the data. In this case, the user manipulates (parts of) the data and the system propagates these change to the schema. Additionally, users may want to restrict mature parts of the schema from being manipulated descriptively, e.g. by marking individual node and edge types as final.

This two-way propagation, from schema to data and from data to schema, is the challenging part for a system. In particular, schema manipulations such as Partitioning/Split and Union/Join imply non-trivial propagations. In the next section, we present the mathematical groundwork for such propagations.

5 Property graph rewriting

In this section, we introduce (sesqui-pushout) graph rewriting rules [9] which are our basic ingredient for performing schema evolution. In our mathematical framework, they allow us to modify property graphs. Graph rewriting for PG schema evolution has been inspired by its use in graph-based knowledge representation and update in biological networks [5, 15, 16].

We first introduce these rules and explain the semantics of the graph rewrites that they perform. Then, in Section 5.2, we focus on the application of these rules to schema graphs and instance graphs by propagating the corresponding operations from schemas to instances and vice versa.

5.1 Rewriting rules

A rewriting rule is defined by three property graphs—, and —and two homomorphisms and . The graph is called the left-hand side (LHS) of the rule; a matching of the rule into a graph is specified by an injective homomorphism that preserves mandatory properties: if then . The graph is called the preserved region and is the right-hand side (RHS) of the rule.

The effect of rewriting through the matching can be specified abstractly due to the existence of certain operations on property graphs: a generalized set intersection, called pullback; a generalized set union, called pushout; and a generalized set difference, called pullback complement. However, we can give an equivalent but more concrete definition in terms of elementary transformations of .

A rule is restrictive if is the identity function. We can ‘read off’ statically, from such a rule, a collection of elementary deletions and clones as follows:

  • a node, edge or property that occurs in but is not in the image of should be deleted;

  • a single node of that is the image of nodes in through should be cloned times.

L [d,rightarrowtail,”m”’] & P [l,”ℓ”’] [d, rightarrowtail, blue, ”m^-”]
G & G^- [l,blue,”ℓ^-”]

A restrictive rewrite of is specified by the pullback complement (in blue) of and , where is the result of applying the elementary transformations of the rule to . Concretely, failures of surjectivity of give rise to deletions while cloning arises from failures of injectivity.

A rule is expansive if is the identity function. Such a rule only performs additions and merges which we can ‘read off’ statically as follows:

  • a node, edge or property that occurs in but is not in the image of should be added;

  • nodes in that all map through to the same node of should be merged into a single node.

P [d,rightarrowtail,”m^-”’] [r,”r”] & R [d,rightarrowtail,blue,”m^+”]
G^- [r,blue,”r^+”’] & G^+

An expansive rewrite of is specified by the pushout (in blue) of and , where is the result of applying the elementary transformations of the rule to . Concretely, failures of surjectivity in give rise to addition while failures of injectivity give rise to merging.

The overall effect of a rule is determined by performing these two phases of rewriting consecutively: first the restrictive phase; then the expansive phase.

If the graph that we wish to rewrite respects the schema , i.e. we have a homomorphism , then the resulting will still respect provided that the rule respects in the following sense:

[d,”ℓ”’] P [r,”r”] & R [d,”h_R”]
L [r,”h_L”’] & S

we have homomorphisms and from such that . In words, and respect individually and in such a way that they agree on their overlap .

Example. Figure 3 illustrates a graph and a matching of in the instance graph (see Figure 2) where the nodes , and of are mapped to , and respectively. The graph in Figure 4 is the RHS of an expansive rule where is determined by the colour coding. Note that is not injective, as it maps and in to in ; nor is it surjective as it maps nothing to .

Figure 4 also shows the matching of into the rewritten instance: and have been merged into and has been added along with its incident edges as specified in .

firstName: BrynlastName: Davies

imageFile: photo33711.jpgcreationDate: 2010-10-16browserUsed: Firefox

firstName: JoselastName: Alonso

creationDate: 2010-10-30browserUsed: Firefox

firstName: JanelastName: Murray

creationDate: 2010-10-30browserUsed: Safari

creationDate: 2010-10-30browserUsed: Safari

LIKES

KNOWS

KNOWS

LIKES

HAS_CREATOR

HAS_CREATOR

HAS_CREATOR

LIKES

REPLY_OF

HAS_CREATOR

REPLY_OF

Figure 3: A matching of into the instance of Figure 2

creationDate: 2018-11-26

firstName: BrynlastName: Davies

creationDate: 2018-11-26

firstName: JoselastName: Alonso

imageFile: photo33711.jpgcreationDate:{2010-10-16, 2010-10-30}browserUsed: Firefox

firstName: JanelastName: Murray

creationDate: 2010-10-30browserUsed: Safari

creationDate: 2010-10-30browserUsed: Safari

KNOWS

KNOWS

LIKES

HAS_CREATOR

HAS_CREATOR

LIKES

REPLY_OF

HAS_CREATOR

REPLY_OF

HAS_CREATOR

REPLY_OF

Figure 4: The matching of into the rewritten instance

If the rule does not respect , we lose schema validation of with respect to . In order to restore the required homomorphism, the rewrite of must be propagated to (or vice versa).

5.2 Propagation of rewriting

We need to consider four kinds of rewrites: application of a restrictive or an expansive rule to or to . If we rewrite to with an expansive rule, we immediately obtain a homomorphism by composition and no propagation to is required. The same applies to rewriting to with a restrictive rule.

The remaining cases are more involved. If we rewrite to with a restrictive rule, a homomorphism from to may no longer exist since either:

  • we have deleted an element, i.e. a node or edge or property, of to which elements in were mapped; or

  • we have cloned a node in and so no longer know to which node we should map things in .

However, we can determine a canonical rewrite of to some that restores a homomorphism . The rewrite of is determined by propagating the elementary transformations of the rewrite of : (i) a node of that is mapped by to a deleted node of is itself deleted; and (ii) a node of that is mapped by to a cloned node of is itself cloned (the same number of times).

Conversely, if we rewrite to with an expansive rule, we may no longer have a homomorphism from to because

  • we have added an element to that we do not know how to map to ; or

  • we have merged nodes in that are mapped by to different nodes of .

However, we can again deduce a canonical rewrite of to some to restore a homomorphism .

Example. The propagation of a merge to may change the type of many nodes in —including those not directly affected by the rewrite. In Figures 5 and 6, the rule merges a post and a comment in but propagation to has the side-effect that all posts and comments, not just the merged node , now map to the Message node of the updated schema in Figure 7.

firstName: BrynlastName: Davies

imageFile: photo33711.jpgcreationDate: 2010-10-16browserUsed: Firefox

firstName: JoselastName: Alonso

creationDate: 2010-10-30browserUsed: Firefox

firstName: JanelastName: Murray

creationDate: 2010-10-30browserUsed: Safari

creationDate: 2010-10-30browserUsed: Safari

LIKES

KNOWS

KNOWS

LIKES

HAS_CREATOR

HAS_CREATOR

HAS_CREATOR

LIKES

REPLY_OF

HAS_CREATOR

REPLY_OF

Figure 5: A matching of into the instance of Figure 2

firstName: BrynlastName: Davies

imageFile: photo33711.jpgcreationDate: 2010-10-16browserUsed: Firefox

firstName: JoselastName: Alonso

creationDate: 2010-10-30browserUsed: {Firefox, Safari }

firstName: JanelastName: Murray

creationDate: 2010-10-30browserUsed: Safari

LIKES

KNOWS

KNOWS

LIKES

HAS_CREATOR

HAS_CREATOR

HAS_CREATOR

LIKES

REPLY_OF

HAS_CREATOR

REPLY_OF

Figure 6: A matching of into the rewritten instance

Person

firstName: STRINGlastName: STRING

Message

imageFile: STRING?creationDate: STRINGbrowserUsed: STRING

KNOWS

HAS_CREATOR

LIKES

REPLY_OF
Figure 7: The schema after propagation of rewriting

In some use cases, canonical propagation of rewriting to or from the schema does not produce the results we would like. For example, if we clone a node of , we may not wish to clone every instance of this node in ; instead, we may wish to partition the existing instances in into those that are now an instance of one clone versus the others that are now an instance of the other clone.

In effect, such a controlled propagation from the schema amounts to performing the canonical propagation followed by a ‘garbage collection’ phase where all undesired clones of are deleted. This requires us to specify, in addition to the rule itself, the instances affected by the garbage collection phase. The case of canonical propagation occurs if we specify no garbage collection. Propagation to the schema can also be controlled; this amounts to specifying which newly-added nodes of should in fact be merged with pre-existing nodes. A typical use case of controlled propagation to the schema occurs if we wish to propagate a rewrite that adds two node to , only one of which is an instance of an existing node of .

ReGraph revisited. The ReGraph library enables us to express and apply rewriting rules as above. It also computes all necessary propagation of rewrites automatically so that, given a hierarchy and a rewrite of one of its graphs, it performs all necessary rewrites and reconstructs the updated hierarchy. In the case of an instance-schema hierarchy , this guarantees that any rewrite (of or ) results in a valid instance of the (updated) schema.

Controlled propagation is expressed in ReGraph by specifying a relation between the rule and the graph to which we are propagating. In the case above of a partition of the nodes of , this relation would state explicitly which nodes of correspond to which nodes of the rewritten schema; we will see an example of this in the next section. An analogous relation can be used to specify controlled propagation to the schema.

6 Schema evolution

In this section, we investigate the use of ReGraph-style rewriting with propagation in our setting where the hierarchy is of the form . We discuss the use of propagation from and to the schema to capture rigorously the distinction between prescriptive PG schema design (and enforcement) and descriptive PG schema evolution and show how this two-way ‘to-and-from’ dialectic can be used to formalize/make explicit the process of PG schema development.

6.1 SMOs, mathematically

Let us consider the question of providing a language of SMOs from a different perspective. We have defined, in Section 2, a DDL in which we can specify schemas and, in Section 3.1, we explained how to interpret such a specification as a property graph. In Section 5, we defined graph rewriting rules that we can use to modify a schema. These rewrites can be viewed as a formalization of a class of the SMOs of general interest, that correspond to the schema modifications described in Section 4.

However, in order to exploit this, we need to be able to ‘read back’ the modified schema graph into DDL syntax, i.e. we must limit the kinds of mathematical rewrites we perform so that the modified PG schema is still itself the interpretation of some DDL specification. This may not always be possible because, as discussed above, the property graph interpretation of a DDL schema does not represent inheritance explicitly. As such, we could delete a property of the PG schema which, in the DDL schema, came from inheriting some element type. Assuming that other node types of the DDL schema also inherited the same element type, they should all also lose that property—but we have no way to enforce this unless we represent inheritance in our formalism.

In the next section, we explain—in the context of the use case of concept fine-graining—how to capture inheritance formally. We then show—for a restricted class of rewrites: restrictive rewrites that only clone (not delete) and expansive rewrites that only add (not merge)—that we can rewrite our PG schema and then recover the DDL schema to which it corresponds. This means that, starting from a ‘before’ DDL schema, we obtain an ‘after’ DDL schema. In itself, this does not define a concrete syntax of SMOs—whose job would be precisely to transform the ‘before’ into the ‘after’. However, it does sharply focus our attention on the requirements that the SMO syntax must fulfil.

6.2 Conceptual fine-graining

The full SNB schema [13] contains an abstract class of Messages that is inherited by the concrete classes of Posts and Comments. We might imagine that, at an earlier stage of its development, the schema contained only the Message class but that its users began to evolve an ad hoc refinement of this class by adding a new descriptive property to instances of Messages to specify whether they are intended as a ‘post’ or as a ‘comment’. In order to maintain validity of the instance, this property would have had to be explicitly added to the Message node of the schema. Eventually, this ad hoc evolution of the schema could have been codified prescriptively by cloning the Message node into Post and Comment nodes. A major update of the instance would then have been necessary in order to recover validity with respect to this finer-grained schema.

Let us replay this hypothetical scenario in our ReGraph-based framework. Our starting point is the schema of Figure 7 with the instance of Figure 2 where all blue and yellow nodes are therefore instances of Message.

Descriptive updates. We begin by defining a rewriting rule that adds a property type:post to a node and applying this rule to some message node of the instance. According to canonical propagation of rewriting, this would add the property type:post to the Message node of the schema. Subsequent applications of this rule to other message nodes would update only the instance; ReGraph would not propagate to the schema as the rule respects the updated schema. However, if we create a second rule that adds the property type:comment to a node, applying this rule to some message node of the instance would induce a second propagation to the schema because a novel value is being associated with the property: the overall effect would be to update the property in the schema to type:\{post,comment\}. In other words, ReGraph-style propagation automatically updates the schema as and when users add such descriptive type properties to the instance.

Prescriptive updates. We continue by defining a third rule that clones a node with property type:\{post,comment\} and applying it to the Message node of the schema; see Figure 8. The effect of this is precisely to split the Message node into Post and Comment nodes as in Figure 1. However, if we use canonical propagation to the instance graph, to recover schema validation, this would have the (unintended) effect of duplicating every Message as both a Post and a Comment. Instead, we perform a controlled propagation which would map nodes of with type:post to the node Post of the updated schema (and similarly for Comments); see Figure 9.

In other words, ReGraph-style controlled propagation updates the instance after a prescriptive update of the schema. Let us note, however, that in order to specify the controlled propagation, we first need to perform a query on the instance: in this example, we need to match all instances of Message with property type:post (and similarly for type:comment)) in order to partition the instance appropriately. In particular, an instance of Message that happens not to have the type property will be cloned as both a Post and a Comment.

Message

imageFile: STRINGtype: {post, comment}

Post

imageFile: STRINGtype: post

Comment

type: comment

Person

Message

firstName: BrynlastName: Davies

imageFile: photo33711.jpgcreationDate: 2010-10-16browserUsed: Firefoxtype: post

firstName: JoselastName: Alonso

creationDate: 2010-10-30browserUsed: {Firefox, Safari }

firstName: JanelastName: Murray

creationDate: 2010-10-30browserUsed: Safaritype: comment

LIKES

KNOWS

KNOWS

LIKES

HAS_CREATOR

HAS_CREATOR

HAS_CREATOR

REPLY_OF

Figure 8: Prescriptive update of the schema with controlled propagation expressed by the dashed lines

firstName: BrynlastName: Davies

imageFile: photo33711.jpgcreationDate: 2010-10-16browserUsed: Firefoxtype: post

firstName: JoselastName: Alonso

creationDate: 2010-10-30browserUsed:{Firefox, Safari }

creationDate:2010-10-30browserUsed:{Firefox, Safari }

firstName: JanelastName: Murray

creationDate: 2010-10-30browserUsed: Safaritype: comment

LIKES

KNOWS

KNOWS

LIKES

LIKES

HAS_CREATOR

HAS_CREATOR

HAS_CREATOR

HAS_CREATOR

REPLY_OF

REPLY_OF
Figure 9: The updated instance after controlled propagation

Overall, this example shows how the functionality provided by ReGraph to propagate changes automatically to and from the schema provides strong support for this use case—that combines prescriptive and descriptive aspects—which is a typical example of what can happen during the development of a system from an early agile state to a late mature state.

6.3 Towards SMOs

Note that the prescriptive update that clones the Message node corresponds precisely to (i) defining two new node types that, in the DDL syntax, inherit Message; and (ii) deleting the Message node type. This suggests that we can keep track of inheritance in our formalism by maintaining not only the current schema but also the rule applications that we used to construct it, i.e. an audit trail that recapitulates the schema development process. This implies that all properties added to a node must be pushed through to the end of the trail so all inheriting nodes also get the new property.

[row sep = tiny] & S^- [dl]
S [dr]
& S^+

More precisely, suppose we start with a schema and (i) specify some inheritance/cloning, through a restrictive rewrite, leading to a refined schema ; and also (ii) specify some extension/addition of properties, through a (direct or propagated) expansive rewrite, leading to a richer schema .

[row sep = tiny] & S^- [dr]
& & S’ [dl]
& S^+

We can push the expansive rewrite through to to obtain a new schema that combines both updates, i.e. if rewrite (ii) added a property to a node that was cloned by rewrite (i), all the clones would have the new property in .

As such, —but neither nor —can be read back to a DDL schema which is a natural modification of the original DDL schema (for ): it adds and removes the necessary node types and adds the necessary properties to the appropriate element types. In a sense, this means that we could define this class of schema modifications by literally importing the initial DDL schema into ReGraph, rewriting it in situ then reading back the updated DDL schema—but, of course, this is not a practical proposal. Nonetheless, our analysis tells us that the only SMOs needed to express the inherit and extension operations performed by this restricted class of rewriting rules are the create, change and split operations from Section 4.

We must leave to future work the extension of our analysis to general rewriting rules and the definition of a practical concrete syntax for SMOs.

Let us conclude this Section by noting that, somehow dual to additions, deletions in an audit trail must be pushed back to the beginning as nodes cannot pick and choose which properties they inherit from an element type. On the other hand, deleting a property in the instance graph would not propagate to the schema and would require no special treatment.

7 Implementation

In this section we would like to address the implementation issues that one must tackle in order to build the system presented above using a generic PG query engine. Although our current implementation (a prototype system proposed by ReGraph

) is built on the Cypher language, we have opted to present the graph pattern matching and update operations in a generic fashion and to have operations that are agnostic to the particular choice of query language. In fact, any query language can be enhanced with the additional data manipulation capabilities that are needed in order to implement graph rewriting and propagation.

It should be noted that in this paper we do not address the problem of performance and scalability of the proposed system, which would heavily rely on such of the underlying Cypher query engine. Our main goal is to stress the conciseness and comprehensibility of our two-way schema/instance updates, as well as the possibility to automate these updates saving the user from thousands of lines of routine queries necessary to perform schema/instance rewriting and propagation.

7.1 Property graph rewriting

Given a property graph , a rewriting rule and an instance we would like to obtain a graph query that performs in-place rewriting of . Such a query would consist of a graph pattern matching subquery along with an update subquery which, when executed, would transform into .

Graph pattern matching. Graph pattern matching can be performed using a typical match clause in a given graph query language. It should allow us to obtain a set of bindings from the nodes and edges of to the actual nodes and edges of . Note that, in our setting, the instance of a rule is always given by an injective homomorphism which means that the nodes and edges of our patterns are always distinct graph objects. Such semantics is also known as isomorphism-based semantics in the literature of graph query languages [2] where both node and edge variables must be mapped one-to-one.

Update subqueries. Let be a property graph as defined in Section 3.1. We now show how to formulate the elementary graph transformations needed for graph rewriting and we do so by considering the aforementioned “agnostic” operations such as addition/removal of graph elements and their properties.

Some primitive graph transformations have immediate counterparts in most modern graph query languages (such as addition/removal of nodes and edges and set-up of properties). However, it is usually necessary to program clone and merge operations as discussed in the remainder of this section.

To ease our presentation we first define the notion of a property dictionary associated with a node or edge as well as the operation of dictionary union. We define the set of property keys associated to a graph element as and, for every such key , the set of its property values as . We can now define a property dictionary of an element as . As such, any dictionary is an element of .

The operation of dictionary union for any two dictionaries can be now defined as the union of their keys and the value sets corresponding to these keys, where if a particular key is present in, for example, but not in , the set of values corresponding to in the union is simply the value set of from .

Clone. The cloning operation for a particular node of interest creates a new node and copies to it all the properties of the original one. All the incident edges, together with their properties, are also copied to the newly created node. Query 1 illustrates such a cloning operation expressed as a language-agnostic pseudo-query; while Appendix C contains a real Cypher 9 query generated by the prototype system implemented in ReGraph.

Merge. The merging operation for a given set of nodes can be implemented in two different ways. First, we can create a new node (that we will use as the result of merge), set its property dictionary to be the union of the dictionaries of all the nodes we merge, reconnect all the neighbours and finally remove all the nodes we are merging. The second version of merge can be implemented by picking an arbitrary node from the set of nodes that need to be merged and using it, instead of a new node, to perform the above operations. In our implementation, we chose the second version as it lets us save some extra operations of property and edge addition.

There is, however, a subtle issue to be discussed concerning the simplicity of graphs in which we want to perform a merging operation. Its behaviour, when reconnecting edges, differs slightly for simple and non-simple graphs. This is due to the fact that if two nodes to be merged have edges from/to the same neighbour in a simple graph, these edges should be merged into a single edge whose properties are given by the union of the original property dictionaries. Therefore, a merge operation in simple graphs has an overhead compared with non-simple graphs.

The pseudo-query for non-simple graphs is illustrated in Query 2, while the query for simple ones can be found in Appendix B. Although we focus on simple graphs in this paper, we report both versions in order to show the extra operations required when switching from non-simple graphs to simple graphs. Appendix D contains a real Cypher 9 query for merging in simple graphs generated by our prototype.

Data: , (node to clone)
create node set properties of to for  do
       create edge
       set properties of , where
end for
for  do
       create edge
       set properties of , where
end for
Query 1 Clone of a node
Data: , (nodes to merge)
set properties of to for  do
       create edge
       set properties of , where
end for
for  do
       create edge
       set properties of , where
end for
detach and delete
Query 2 Merge of two nodes in non-simple graphs

In what follows, for the sake of conciseness, we refer to clone and merge operations as “clone into ” and “merge into ”, i.e. generalizing clone and merge to work on nodes. They also represent the primitive operations whose incorporation into a graph query language would significantly facilitate the support for graph rewriting and propagation.

7.2 Propagation of rewriting

Let be an instance graph typed by a schema graph through an homomorphism . In this section, we focus on the case where rewriting of or requires propagation of changes (as described in Section 5) and, in particular, how to express this propagation with elementary graph operations that can be translated into typical graph query language clauses. Here we formulate updates of both schema and instance graphs using generic graph update operations as in the previous subsection. Note that, for simplicity of exposition, we assume that both and are simple graphs.

According to our approach for both rewriting cases (rewriting of to with a restrictive rule and to with an expansive rule) we first perform rewriting that invalidates , then we compute a set of transformations that “repair” the typing by producing a new graph homomorphism. We discuss both cases in the following.

Propagation to instance. Recall that a restrictive rewriting of to gives us the homomorphism (as in Section 5). Given and we can construct the following set of pairs . This set of pairs can be used to infer the “repair” transformations necessary for rewriting of to and restoring . For every node of the instance graph we define the typing relation with nodes from as . Now we can use the following pseudo-query to perform the necessary clones and deletes in to produce the homomorphism which, due to the fact that our graphs are simple, uniquely defines a homomorphism .

for  such that  do
       detach delete
      
end for
for  such that  do
       pick an element
       set
       for  do
             clone as
             set
            
       end for
      
end for
for  such that  do
       delete edge
      
end for
Query 3 Propagation to the instance graph

Propagation to schema.We now consider the case where an expansive rewrite of the instance graph to induces some changes to the schema. Such a rewriting gives us the homomorphism (as in Section 5). Now given and we can construct the following set of pairs . In this case can be used to infer the necessary rewriting of the schema graph to and restoring . For every node of the rewritten instance graph , we define the typing relation with nodes from as . Query 4 performs the node additions and merges followed by necessary edge additions that are required to construct and :

for  such that  do
       create node in
       set
      
end for
for  such that  do
       merge as
       set
      
end for
for  such that  do
       create edge in
      
end for
Query 4 Propagation to the schema graph

8 Discussion

In this section, we discuss additional details of the implementation of our mathematical framework and its suitability for understanding the requirements of PG schema validation and evolution.

The Python library ReGraph on which we build was originally based on in-memory NetworkX777https://networkx.github.io and later extended to work directly with the Neo4j graph database. This necessitates a certain amount of encoding in order to (i) maintain the multiple graphs that constitute a hierarchy within the single graph (and so namespace) currently provided by Neo4j; and (ii) to represent the homomorphisms of the hierarchy as edges with a ‘reserved’ label that encodes the typing. Moreover, because we consider properties that have sets of values, we must encode this using Neo4j lists. However, we anticipate that future versions of Neo4j will reduce this overhead of encoding, notably by providing native support for multiple graphs.

The encoding into Neo4j allows us to represent rewriting rules as Cypher queries that are computed automatically. Although the current version of Cypher supports basic update operations888Precisely, in our development we focus on the following Cypher operations: create to create nodes and edges, merge to match and create edges, set to set the properties of nodes and edges, delete to delete nodes and edges, detach delete to force removal of nodes with incident edges., the lack of native support for clone and merge operations leads to a significant blow-up in the size of the query (as reported in Appendix C and Appendix D), compared to the rule itself, notably due to the requirement that the homomorphic mapping from the instance to the schema must be maintained at all times. A further minor issue arises, in our encoding of merge operations, where we are obliged to make one limited use of the apoc library because we cannot represent a key as a variable. Presumably, one could envision the addition of clone and merge operations, in the style we showed in this paper, to future versions of graph query languages (including GQL, OpenCypher and G-CORE [14, 1].

Finally, let us note that our current interpretation of the semantics of Cypher 9’s update operations used in our implementation is based on their practical usage due to the lack of formal semantics for these operations. The formalization of the update fragment of Cypher is actually ongoing999Leonid Libkin, private communication. and will soon lead to a formal interpretation, similar to that realized for its read-only fragment [14].

An in-depth study of the computational complexity of the schema propagation operations based on graph rewriting rules as presented here falls beyond the scope of our paper. Deciding well-typedness of a graph pattern with no ISA edges under a schema graph with no implicit object class nodes in the GDM model (roughly corresponding to our DDL) is in PTIME [3], thus leading us to the conjecture that our schema validation is in PTIME as well. However, the precise complexity of schema evolution under our DDL, thus entailing revalidation of entire graph patterns after schema modifications or after instance modifications is unknown and left as one of the open theoretical questions of our work.

Conceptually, in ReGraph we represent a schema as a property graph that contains no real data but only constraints on the data that is permitted. In our development, we provide two possible ways to build and maintain the schema graph : one uses symbolic types to constrain the values that a property can take; the other accumulates its set of permitted values. The first choice more closely matches the PG DDL specification—and usual intuition—of schemas whereas the second has a non-standard flavour of mixing actual data with constraints. Nonetheless, the second option may be of interest, at least in earlier descriptive phases of schema development, as it can exploit propagation from an instance to the schema to accumulate sets of permitted values automatically. At some point, more generic constraints on data should start to become apparent and one can switch to the more traditional mode of prescriptive schema development using symbolic types.

Our current implementation is external to Neo4j in the sense that schema validation and evolution only make sense through the lens of ReGraph. Although this implies a certain overhead, as discussed above, it nonetheless provides a very useful test-bed for a thorough conceptual and technical debugging of the requirements on a modern, flexible system for PG schema validation and evolution—prior to the significant technical effort that would be required to internalize this into native support for schemas in Neo4j.

Moreover, although we have focused exclusively on schemas and instances in this paper, the external framework provided by ReGraph enables an entirely different mode of use of Neo4j (or other graph databases) which decouples the user from the concrete data model and allows them to define their own domain-specific knowledge representation system. Apart from that, one can work on intermediate representations in between concrete graph instances and schemas, such as updatable graph views or graph summaries.

9 Related Work

Schema evolution [27] is a well established topic in data management. A set of principles ruling out schema and instance evolution under schema constraints were discussed in Hartung et al. [17]. There are various approaches to increase comfort and efficiency, e.g. defining a schema evolution aware query language [28] or by providing a general framework to describe database evolution in the context of evolving applications [12].

The Meta Model Management 2.0 [7] of Bernstein et al. introduced a comprehensive tooling to match, merge and diff given relational schema versions. The resulting mappings couple the evolution of both the schema and the data; however, these mappings are complex relationships between heterogeneous schemas, as in data integration and ETL scenarios, i.e. they only deal with schema evolution after the fact.

Currently, PRISM [11, 25, 10] and InVerDa [20, 19] seem to provide the most advanced database schema evolution tools. PRISM focuses on plain database evolution but also allows the answering of queries using former schema versions with respect to the current data. InVerDa provides fully co-existing schema versions via bidirectional transformations [32] with symmetric relational lenses [21].

Another interesting category of tools targets the co-evolution of different artefacts in an information system, e.g. MoDEF [31] introduces an IDE extension to automate the co-evolution of the evolving client schemas and the store. However, none of these approaches steps away from the underlying assumption of a prescriptive schema.

Apart from relational databases, schema evolution is a hot topic for XML databases and ontology management systems, as surveyed in [17]. The major data vendors, such as Oracle, Microsoft SQL Server, and IBM DB2, offer support for XML Schema. Other research efforts, such as X-Evolution [24] and XEM [30], addressed the problem of incremental XML validation, where incremental means that empty or default XML elements are often inserted to fill the gaps where a XML document no longer validates. Due to the tree-shaped nature of XML data, these approaches are quite different from ours and are still focused on prescriptive schema—with the exception of XML schema of type xsd:anyType which can encode unconstrained XML content.

SHACL [33] is a language for validating RDF graphs against a set of conditions. These conditions are provided as shapes under the form of an RDF graph. Shapes are used to validate RDF instances against a set of conditions and they can also be viewed as descriptors of the data that do satisfy these conditions. SHACL supports RDF terms restrictions (e.g. value restrictions, allowed values, datatypes comparison etc), cardinality constraints, and predicate constraints (e.g. required predicates, class-specific property range etc).

Ontologies are conceptually more abstract models than database schemas and range from controlled vocabularies and thesauri over is-a hierarchies/taxonomies and directed acyclic graphs [17]. Instances have different roles in ontology management systems and typically lie in completely separate data sources. Moreover, ontologies are usually representatives of a specific domain and are the final outcome of collaborative editing from one or more domain experts.

Research on ontologies also considered the problem of update propagation to instances using Description Logic mappings [22, 36, 35]. However, the ontology formalisms are confined to interpretation from a restricted set of experts as opposed to DDL-like languages in RDBMS and in particular to the DDL proposed in this paper for graph databases. Description Logic mappings are also quite complex when contrasted with the implicit homomorphisms considered in our work.

The distinction between descriptive and prescriptive schemas as carried out in our paper is reminiscent of open and close tuple types as used for instance in JSON [26]. An open type allows a tuple to contain additional attributes beyond those appearing in the schema declaration, whereas a closed type would not allow it. However, the schema flexibility pointed out in our work affects not only types but entire portions of the schemas and as such is more general.

Graph rewriting has been used in a variety of areas related to knowledge representation and meta-modelling. For example, triple graph grammars [29, 23, 4]—which correspond very closely to our rewriting rules—provide a means to specify bidirectional model transformations and have been used in various applications such as conformance testing and model synchronization. Another example is the KAMI bio-curation tool [16], which represents (i) knowledge about protein–protein interactions as graphs; and (ii) updates of knowledge as rewriting rules that propagate to a model-specific schema in descriptive fashion; and moreover (iii) provides a fixed, prescriptive meta-model that constrains the entire system. As such, we see—albeit in a three-level rather than two-level system—an example of the co-existence of more or less mature aspects of schemas within a single application which enables the tool to remain responsive to novel knowledge—provided that that knowledge at least fits within its view of the universe, as defined by its meta-model.

10 Concluding Remarks

We have presented a schema DDL for property graphs following the ASCII-art syntax inspired by Cypher. We have shown how schema validation and schema evolution for graphs can be simulated via a mathematical framework that allows to enforce schema and express propagation from schema to instance and vice versa. We have discussed how to achieve modern schema requirements for property graph databases by offering support for both prescriptive and descriptive schemas. We have discussed an implementation in a pseudo-query language, which is agnostic to concrete graph query language syntax, and provided some details and discussion of our specific encoding in Cypher 9.

We believe that our work can be extended in at least two possible directions. The first direction would add a third layer to the graph hierarchy and study how to apply modifications to a hybrid, or summary, graph that lies between the instance and the schema, i.e. in a hierarchy of the form , and which would play the role of an updatable graph view. Secondly, the preliminary discussion in this paper concerning a schema manipulation language would require the study and definition of concrete syntax proposals for such a language. We hope that this work already provides insights towards that goal and triggers a discussion on these languages—that are much needed for application development, from prototyping to production.

References

  • [1] R. Angles, M. Arenas, P. Barceló, P. A. Boncz, G. H. L. Fletcher, C. Gutierrez, T. Lindaaker, M. Paradies, S. Plantikow, J. F. Sequeda, O. van Rest, and H. Voigt. G-CORE: A core for future graph query languages. In SIGMOD, pages 1421–1432, 2018.
  • [2] R. Angles, M. Arenas, P. Barceló, A. Hogan, J. L. Reutter, and D. Vrgoc. Foundations of modern query languages for graph databases. ACM Comput. Surv., 50(5):68:1–68:40, 2017.
  • [3] R. Angles and C. Gutiérrez. Survey of graph database models. ACM Comput. Surv., 40(1):1:1–1:39, 2008.
  • [4] A. Anjorin, E. Leblebici, and A. Schürr. 20 Years of Triple Graph Grammars: A Roadmap for Future Research. Electronic Communications of the EASST, 73, 2015.
  • [5] A. Basso-Blandin, W. Fontana, and R. Harmer. A knowledge representation meta-model for rule-based modelling of signalling networks. EPTCS, 204:47–59, 2016.
  • [6] J. L. Beckmann, A. Halverson, R. Krishnamurthy, and J. F. Naughton. Extending RDBMSs To Support Sparse Datasets Using An Interpreted Attribute Storage Format. In ICDE, pages 58–67, 2006.
  • [7] P. A. Bernstein and S. Melnik. Model Management 2.0: Manipulating Richer Mappings. In SIGMOD, pages 1–12, 2007.
  • [8] A. Bonifati, G. Fletcher, H. Voigt, and N. Yakovets. Querying Graphs. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2018.
  • [9] A. Corradini, T. Heindel, F. Hermann, and B. König. Sesqui-pushout rewriting. In International Conference on Graph Transformation, pages 30–45. Springer, 2006.
  • [10] C. Curino, H. J. Moon, A. Deutsch, and C. Zaniolo. Automating the database schema evolution process. The VLDB Journal – The International Journal on Very Large Data Bases, 22(1):73–98, Feb. 2013.
  • [11] C. Curino, H. J. Moon, and C. Zaniolo. Graceful Database Schema Evolution: the PRISM Workbench. The Proceedings of the VLDB Endowment, 1(1):761–772, Aug. 2008.
  • [12] E. Domínguez, J. Lloret, A. L. Rubio, and M. A. Zapata. MeDEA: A database evolution architecture with traceability.

    Data and Knowledge Engineering

    , 65(3):419–441, June 2008.
  • [13] O. Erling, A. Averbuch, J. Larriba-Pey, H. Chafi, A. Gubichev, A. Prat-Pérez, M. Pham, and P. A. Boncz. The LDBC Social Network Benchmark: Interactive Workload. In SIGMOD, pages 619–630, 2015.
  • [14] N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V. Marsault, S. Plantikow, M. Rydberg, P. Selmer, and A. Taylor. Cypher: An evolving query language for property graphs. In SIGMOD, pages 1433–1445, 2018.
  • [15] R. Harmer. Rule-based meta-modelling for bio-curation. Habilitation à Diriger des Recherches, ENS Lyon, France, 2017.
  • [16] R. Harmer, Y.-S. L. Cornec, S. Légaré, and I. Oshurko. Bio-curation for cellular signalling: The kami project. In Computational Methods in Systems Biology, pages 3–19. Springer International Publishing, 2017.
  • [17] M. Hartung, J. F. Terwilliger, and E. Rahm. Recent advances in schema and ontology evolution. In Schema Matching and Mapping, pages 149–190. 2011.
  • [18] K. Herrmann, H. Voigt, A. Behrend, and W. Lehner. CoDEL - A Relationally Complete Language for Database Evolution. In ADBIS, volume 9282, pages 63–76, 2015.
  • [19] K. Herrmann, H. Voigt, A. Behrend, J. Rausch, and W. Lehner. Living in Parallel Realities: Co-Existing Schema Versions with a Bidirectional Database Evolution Language. In SIGMOD, pages 1101–1116, 2017.
  • [20] K. Herrmann, H. Voigt, T. B. Pedersen, and W. Lehner. Multi-schema-version data management: data independence in the twenty-first century. The VLDB Journal – The International Journal on Very Large Data Bases, 27(4):547–571, Aug. 2018.
  • [21] M. Hofmann, B. C. Pierce, and D. Wagner. Symmetric Lenses. In POPL, pages 371–384, 2011.
  • [22] E. Kharlamov, D. Zheleznyakov, and D. Calvanese. Capturing model-based ontology evolution at the instance level: The case of dl-lite. J. Comput. Syst. Sci., 79(6):835–872, 2013.
  • [23] A. Königs and A. Schürr. Tool Integration with Triple Graph Grammars - A Survey. Electronic Notes in Theoretical Computer Science, 148(1):113–150, 2006.
  • [24] M. Mesiti, R. Celle, M. A. Sorrenti, and G. Guerrini. X-evolution: A system for XML schema evolution and document adaptation. In EDBT, pages 1143–1146, 2006.
  • [25] H. J. Moon, C. Curino, M. Ham, and C. Zaniolo. PRIMA: Archiving and Querying Historical Data with Evolving Schemas. In U. Çetintemel, S. B. Zdonik, D. Kossmann, and N. Tatbul, editors, SIGMOD, pages 1019–1022, 2009.
  • [26] K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ semi-structured data model and query language: A capabilities survey of sql-on-hadoop, nosql and newsql databases. CoRR, abs/1405.3631, 2014.
  • [27] E. Rahm and P. A. Bernstein. An Online Bibliography on Schema Evolution. SIGMOD Record, 35(4):30–31, Dec. 2006.
  • [28] J. F. Roddick. SQL/SE - A Query Language Extension for Databases Supporting Schema Evolution. SIGMOD Record, 21(3):10–16, Sept. 1992.
  • [29] A. Schürr. Specification of Graph Translators with Triple Graph Grammars. In E. W. Mayr, G. Schmidt, and G. Tinhofer, editors, Workshop on Graph-Theoretic Concepts in Computer Science, pages 151–163, 1994.
  • [30] H. Su, D. Kramer, L. Chen, K. T. Claypool, and E. A. Rundensteiner. XEM: managing the evolution of XML documents. In RIDE-DM, pages 103–110, 2001.
  • [31] J. F. Terwilliger, P. A. Bernstein, and A. Unnithan. Worry-Free Database Upgrades: Automated Model-Driven Evolution of Schemas and Complex Mappings. In SIGMOD, pages 1191–1194, 2010.
  • [32] J. F. Terwilliger, A. Cleve, and C. Curino. How Clean Is Your Sandbox? - Towards a Unified Theoretical Framework for Incremental Bidirectional Transformations. In Z. Hu and J. de Lara, editors, ICMT, pages 1–23, 2012.
  • [33] https://www.w3.org/TR/shacl/.
  • [34] H. Voigt and W. Lehner. Flexible Relational Data Model - A Common Ground for Schema-Flexible Database Systems. In Y. Manolopoulos, G. Trajcevski, and M. Kon-Popovska, editors, ADBIS, volume 8716, pages 25–38, 2014.
  • [35] Z. Wang, K. Wang, Z. Zhuang, and G. Qi. Instance-driven ontology evolution in dl-lite. In AAAI, pages 1656–1662, 2015.
  • [36] J. Wu and F. Lécué. Towards consistency checking over evolving ontologies. In CIKM, pages 909–918. ACM, 2014.

Appendix

Appendix A LDBC SNB schema

The entire DDL schema encoding of the LDBC SNB benchmark is reported below:

CREATE GRAPH TYPE snb (
  Person {
    creationDate  : TIMESTAMP,
    firstName     : STRING,
    lastName      : STRING,
    gender        : STRING,
    birthday      : DATE,
    email         : STRING,
    speaks        : STRING,
    browserUsed   : STRING,
    locationIP    : STRING
  },
  Organisation {
    name : STRING, url : STRING
  },
  Company <: Organisation {},
  University <: Organisation {},
  Message {
    creationDate  : TIMESTAMP,
    browserUsed   : STRING,
    locationIP    : STRING,
    content       : STRING?,
    length        : INTEGER
  },
  Comment <: Message {},
  Post <: Message {
    language : STRING?, imageFile : STRING?
  },
  Forum {
    title : STRING, creationDate : TIMESTAMP
  },
  TagClass {
    name : STRING, url : STRING
  },
  Place {
    name : STRING, url : STRING
  },
  City <: Place {},
  Continent <: Place {},
  Country <: Place {},
  Tag {
    name : STRING, url : STRING
  },
  HAS_TYPE {},
  HAS_TAG {},
  IS_SUBCLASS_OF {},
  HAS_MODERATOR {},
  HAS_CREATOR {},
  REPLY_OF {},
  HAS_INTEREST {},
  CONTAINER_OF {},
  IS_PART_OF {},
  IS_LOCATED_IN {},
  KNOWS {
    creationDate : TIMESTAMP
  },
  HAS_MEMBER {
    joinDate : TIMESTAMP
  },
  WORK_AT {
    workFrom : INTEGER
  },
  STUDY_AT {
    classYear : INTEGER
  },
  LIKES {
    creationDate : TIMESTAMP
  },
  (Post),
  (Comment),
  (Continent),
  (Country),
  (City),
  (University),
  (Company),
  (Tag),
  (Person),
  (Forum),
  (TagClass),
  (Country)-[IS_PART_OF]->(Continent),
  (Forum)-[HAS_TAG]->(Tag),
  (Person)-[IS_LOCATED_IN]->(City),
  (Comment)-[REPLY_OF]->(Message),
  (University)-[IS_LOCATED_IN]->(City),
  (Person)-[HAS_INTEREST]->(Tag),
  (TagClass)-[IS_SUBCLASS_OF]->(TagClass),
  (City)-[IS_PART_OF]-><1>(Country),
  (Person)-[WORK_AT]->(Company),
  (Forum)-[HAS_MODERATOR]->(Person),
  (Forum)-[HAS_MEMBER]->(Person),
  (Message)-[HAS_CREATOR]->(Person),
  (Tag)-[HAS_TYPE]->(TagClass),
  (Company)-[IS_LOCATED_IN]->(Country),
  (Message)-[HAS_TAG]->(Tag),
  (Message)-[IS_LOCATED_IN]->(Country),
  (Person)-[STUDY_AT]->(University),
  (Person)-[LIKES]->(Message),
  (Forum)-[CONTAINER_OF]->(Post),
  (Person)-[KNOWS]->(Person)
)

Appendix B Merging pseudoquery for simple graphs

Data: , (nodes to merge)
// here we pick to serve as the result of merge
set properties of to for  such that  do
       create edge
       set properties of , where
end for
for  such that  do
       set properties of , where and
end for
for  such that  do
       create edge
       set properties of , where
end for
for  such that  do
       set properties of , where and
end for
if  and  then
       create edge
set properties of
detach and delete

Appendix C Clone Cypher query

// Query performing clone of a node
MATCH (a { id : 'a' })
// create a node corresponding to the clone
CREATE (a1)
WITH a, a1
SET a1 = a
WITH a, a1
// match successors and out-edges
OPTIONAL MATCH (a)-[out_edge:edge]->(suc)
WITH a, a1, filter(
  el IN collect(
    {neighbor: suc, edge: out_edge})
  WHERE NOT el.neighbor IS NULL) as suc_maps
// match predecessors and in-edges
OPTIONAL MATCH (pred)-[in_edge:edge]->(a)
WITH a, a1, suc_maps, filter(
  el IN collect(
    {neighbor: pred, edge: in_edge})
  WHERE NOT el.neighbor IS NULL) as pred_maps
// copy all incident edges of the original node
FOREACH (suc_map IN suc_maps |
  FOREACH(suc IN [suc_map.neighbor] |
    CREATE (a1)-[new_edge:edge]->(suc)
    SET new_edge = suc_map.edge))
FOREACH (pred_map IN pred_maps |
  FOREACH(pred in [pred_map.neighbor] |
    CREATE (pred)-[new_edge:edge]->(a1)
    SET new_edge = pred_map.edge))
// copy self loop
FOREACH (suc_map IN suc_maps |
  FOREACH (self_loop IN
    CASE WHEN suc_map.neighbor=a
    THEN [suc_map.edge] ELSE [] END |
      CREATE (a1)-[new_edge:edge]->(a1)
      SET new_edge = self_loop))
WITH a, a1
RETURN a1

Appendix D Merge Cypher query