The notion of a property graph originated in the early 2000s in the Neo4j111https://neo4j.com graph database system, and was popularized by what is now Apache TinkerPop,222http://tinkerpop.apache.org a suite of vendor-agnostic graph database tools including the Gremlin graph programming language. For most of their history, property graphs have been the stock-in-trade of software developers creating applications loosely based on variously mathematical notions of labeled graph, but with little formal semantics or type checking associated with the labels. In that respect, property graphs differ from more heavyweight standards designed for knowledge representation, including the Resource Description Framework (RDF) and the Web Ontology Language (OWL).
In recent years, the developer community has increasingly turned to property graphs for large-scale data integration efforts including enterprise knowledge graphs
enterprise knowledge graphs, i.e. graph abstractions that integrate a broad swath of a company’s data, often drawn from a variety of internal data sources and formats. These abstractions have expanded the de-facto meaning of graphs and have stretched the simple, intuitive property graph concept to its limits, leading to recent community efforts around standardization, such as the W3C Workshop on Web Standardization for Graph Data333https://www.w3.org/Data/events/data-ws-2019/ and the associated Property Graph Schema Working Group.444https://3.basecamp.com/4100172/projects/10013370 The authors of this paper are also involved in the Working Group, and our formalism was designed with an eye toward integration with the emerging standard. At the same time, we take more of a minimalist approach, building upon a core concept which has been essential for bridging the gap between typical property graphs, RDF datasets, and production datasets at Uber: algebraic data types [mitchell].
By specifying a mathematically rigorous data model, we aim to provide the common ground that is sought after by both the developer and academic graph communities: a framework which is simple and developer-friendly, yet also formal enough for modern principles of computer science to apply. To that end, we have chosen to describe algebraic property graphs using the language of category theory, which not only emphasizes compositionality and abstract structure, but also comes equipped with a rich body of results about algebraic data types [mitchell]. Although all of the categorical concepts used in this paper are defined in Appendix A, readers may find textbooks such as [Awodey:2010:CT:2060081] useful. [milewski2018category] is a particularly approachable introduction to category theory for software developers. Our use of category theory has also allowed us to implement this entire paper as a built-in example program in the open-source Categorical Query Language (CQL) [wadt],555http://categoricaldata.net which has significant connections to the work presented here: algebraic property graphs are algebraic databases in the sense of [wadt]. However, as has historically been the case with applications of category theory to data management [grust], category theory is the medium, not the message, of this paper.
This paper is organized as follows. In Section 2, we describe property graphs and their use in Uber and Apache TinkerPop, along with other relevant graph and non-graph data models. Our main contribution is in Section 3, where we define algebraic property graphs along with various derived constructions, and in Section 4
, where we introduce a taxonomy for classifying graph elements according to their schema. We conclude inSection 5. In Appendix A, we review standard material on category theory. Extensions of algebraic data types are described in Appendix B, and algorithms on algebraic property graphs are discussed in Appendix C. Finally, in Appendix D we discuss mappings between algebraic property graphs and selected data models in connection with the upcoming version 4 of Apache TinkerPop.
2 Graph and Non-Graph Data Models
2.1 Property Graphs
Property graphs [rodriguez2010constructions] are a family of graph data models which are typically concerned only with graph structure; graph semantics are left to the application. Every graph in these data models is made up of a set of vertices connected by a set of directed, labeled edges. Vertices and edges are collectively known as elements. Every element has a unique identifier, and may be annotated with any number of key-value pairs known as properties.
Beyond these basic commonalities, property graph data models start to differ. Among implementation-neutral property graph frameworks, the first and most widely used is Apache TinkerPop, which allows graph data models to vary according to a number of dimensions or “features”, which conceptually may be specified by answering certain questions. For example:
Which primitive types are supported in the graph? Are complex types such as lists, maps, and sets supported?
Which types may be used as the identifiers of vertices and/or edges? For example, certain implementations identify elements with integers, others with UUIDs, and others with strings. Still others allow developers to provide IDs of a variety of primitive types.
What kinds of properties are allowed? Usually, both vertex and edge properties are supported, but not always. Furthermore, certain implementations allow so-called meta-properties, described in more detail below.
Additional, vendor-specific schema frameworks provide further degrees of freedom that deal with such things as unlabeled, singly-labeled, and multiply-labeled vertices, inheritance relationships among labels, type constraints and cardinality constraints on properties and edges, higher-order edges, and more.
At Uber, graph-like schemas are written in a variety of formats: primarily in Thrift, Avro, and Protocol Buffers as described below, and also in an internal YAML-based format which is used for standardized vocabularies. Custom tooling is used to transform schemas between these source-of-truth formats, generate documentation, establish interoperability with RDF-based tools, and support other internal frameworks. The tooling allows interrelated sets of schemas to propagate across architectural boundaries. Increasing the compatibility of schemas cuts down on duplication of effort and facilitates the composition of data sources not previously connected. Although these schemas are particular to Uber and have been designed with its data integration efforts in mind, it has been our impression that similar notions of schema are used elsewhere for similar purposes.
2.2 Resource Description Framework (RDF)
The Resource Description Framework [world2014rdf] is a W3C recommendation and the most widely used approach to knowledge representation on the Web. RDF statements are subject-predicate-object triples, any set of which forms an RDF graph. These graphs can be serialized in many formats, from XML-based formats to JSON-based ones. An example RDF graph is shown in Figure 1 using the Turtle RDF format. Individual RDF statements can be seen as arrows in the diagram, while the colors stand in for additional, rdf:type statements such as:
:u1 rdf:type ex:User. :t1 rdf:type ex:Trip. :p2 rdf:type ex:Place. :e1 rdf:type ex:PlaceEvent.
The prefixes rdf: and ex: are abbreviations for namespaces, while the empty prefix : indicates a default namespace. A proper introduction to RDF is beyond the scope of this paper, but a suggested starting point is [manola2014rdf].
Note that although we have given each resource exactly one type in Figure 1, RDF permits any number of type statements per resource, including none at all. Similarly, although each Trip in this example has exactly one driver, exactly one rider, and so on, RDF itself allows any number of such statements in the same graph. In order to add cardinality constraints, one needs an additional formalism on top of RDF, such as the Web Ontology Language (OWL) [owl2012owl] or the Shapes Constraint Language (SHACL) [knublauch2017shapes]
. A basic schema language, RDF Schema, and a pattern-matching query language, SPARQL, are provided with RDF, each with a formal set-theoretic semantics. These formal semantics may help to explain why RDF continues to be heavily used for enterprise knowledge graphs despite the popularity of the more lightweight property graph data models; they give RDF a variety of practical advantages, such as easy portability of data and the ability to meaningfully merge multiple RDF graphs into a single graph. The shortcomings of RDF mainly arise from the complex and sometimes ad-hoc nature of the specifications themselves. Although RDF is an extremely versatile data model for the skilled user, creating specification-compliant tooling is challenging, and formal analyses involving RDF often become bogged down in discussions of less-essential features such as blank nodes or reification. There is a widely recognized need for various simplifying improvements to RDF, but a new, unified effort to update the data model would be a significant undertaking that has not yet been attempted.
In contrast with RDF, property graphs lack not only a widely-accepted standard, but also a formal semantics and an agreed-upon notion of schema; while a number of major property graph vendors provide rich schema languages, these languages are idiosyncratic and pairwise incompatible. In spite of these limitations, property graphs have flourished in the developer community due to their simplicity vis-a-vis RDF. As the strengths and weaknesses of RDF and property graphs are somewhat complementary, there is a long history of building bridges between the two data models, beginning with a tool called neo4j-rdf-sail666https://github.com/neo4j-contrib/neo4j-rdf-sail in 2008 and continuing through the earliest TinkerPop APIs. Formally described mappings such as [hartig2014reconciliation] and [das2014tale] have also begun to gain traction, fostering community-led standardization efforts.
2.3 Data Serialization Languages
Some of the most common serialization languages used for streaming data and remote procedure calls in the enterprise are Apache Thrift, Apache Avro, and Protocol Buffers. There are many others, but these three have had the greatest influence on this paper due to their use at Uber. For data modeling purposes, all three languages are similar, in that they encode a system of algebraic data types based on primitive types such as strings and integers, product types with fields, and sum types with cases. All three include a kind of enumeration, and only version 3 of Protocol Buffers lacks support for optional values. This commonality makes it straightforward to translate types from one framework to another, despite numerous minor incompatibilities. Interfacing with property graph schemas, however, has been more challenging, and has provided some of the motivation for this paper.
A detailed comparison of the languages is shown in Figure 2. Although not yet supported by the tooling at Uber, the GraphQL Schema Definition Language is included in this table because it has recently been suggested as a base language for property graph schemas. Notably, GraphQL SDL it is more similar to the other formats in the table than it is to RDF or to currently mainstream property graph schema languages, such as those of Neo4j or JanusGraph.
A product, called a “struct” in Thrift, a “record” in Avro, and “message” in Protocol Buffers, is an ordered tuple with named fields. It is an instance of a product type. A sum, called a “union” in Thrift and Avro, and a “oneof” in Protocol Buffers, represents a choice between a list of alternatives, sometimes called cases. Enumerations and optionals are also sum types. See Figure 3 for an example of a product type (a struct) and a sum type (an enum) in Thrift IDL syntax.
2.4 Hypergraph Data Models
Although there are many notions of hypergraph in the literature, the term usually refers to:
a data structure which embodies the usual mathematical notion of a hypergraph, i.e. a graph in which a given edge may join any number of vertices; or,
a data structure in which edges are also vertices, and may be connected by further edges.
Hypergraph databases commonly combine these two features along with a notion of edge and/or vertex label, as well as labels for fields or roles, i.e. the named components of a hyperedge. For example, the Hypernode model [levene1990hypernode] conceptualizes each graph node as a graph in its own right, having a label and containing a set of nodes and a set of labeled edges between the nodes. A visual formalism is provided along with the basic data model and a Datalog-based query language. The Groovy data model [levene1991object] takes this visual formalism further and adds a stronger notion of object orientation. HypergraphDB [iordanov2010hypergraphdb] was influenced by Hypernode and Groovy, but makes the notion of edge-as-node explicit. HypergraphDB was also the first hypergraph database to become widely known in the developer community, providing transaction safety and other features commonly expected of a graph database. Most recently, the Grakn hypergraph database [messina2017biograkn] has addressed the problem of aligning graph and relational databases. Among such data models, Grakn is the most similar to what we describe in Section 4.5 on hyperelements.
2.5 Relational Databases
There is a great deal of interplay between graphs and relational database theory, and a correspondingly large amount of past research. Here, we will only make some basic observations. For example, a graph with directed edges and at most one edge between any given pair of vertices is equivalent to a binary relation: the edge relation of the graph. Hence, we can encode such graphs and operations on them in e.g., SQL, and generalizations of this encoding appear in many software systems. These encodings can also be used to prove inexpressivity results, such as the result that no relational algebra query can compute the transitive closure of a graph’s edge relation [Doan:2012:PDI:2401764]. Despite these inexpressivity results, in practice, much graph processing is done on relational systems, and vice versa. See Section D.2 for a discussion of relational databases as algebraic property graphs and vice versa.
3 Algebraic Property Graphs
In this section, we define algebraic property graphs (APGs) and their morphisms, and briefly study the three traditional classes of model management [wadt] operations on APGs: conjunctive queries, data integration according to data linkages, and data migration along schema mappings. To fully understand this section, a familiarity with category theory is required; Appendix A provides an introduction or review, and familiarity with database theory [Doan:2012:PDI:2401764] is helpful. However, the reader is free to skip the more formal material and follow along using the provided examples and diagrams; this is sufficient to get a basic sense of APGs. If this paper is rendered in color, the reader will see distinct colors for the labels, types, elements, and values of graphs, concepts which will be described below. The colors are intended to enhance readability, and are not essential for understanding the text.
The definition of APG is parameterized by a set , the members of which we call primitive types777We make no assumptions about the intended semantics of base types, which will typically be atomic objects such as character strings or floating-point numbers, but complex objects, such as lists, trees, functions in -calculus notation, black-box functions, and even objects in object-oriented programming notation can easily be added to our proposal, as described in Appendix B., and for each , a set , the members of which we call the primitive values of t. An APG , then, consists of:
a set , the members of which we call the labels of , such as Person or name.
a set , the members of which we call the types of , such as Integer, or . Types are defined as terms (expressions) in the grammar:
where we may omit writing Prim and Lbl when they are clear from context.
a set , the members of which we call the elements of , such as or .
a set , the members of which we call values of , such as 42 or . Values are defined as the terms in the grammar:
a function which provides the label of each element; e.g.
a function which provides the value of each element; e.g.
a function which provides the schema of each label; e.g. . We also speak of as the schema of G.
a function which provides the type of each value; e.g. ; this function is defined by the equation
Finally, every APG must obey the equation:
which states that the type of the value of each element is the same as the schema of the label of the element, ensuring that the structure of a graph always matches its schema. This equation can be visualized as a commutative square:
The types and values of an APG are those of a canonical type theory for product and sum data generated by the elements and labels of , along with the given primitive types and primitive values . This type theory forms a (bi-cartesian) category in the usual way, as described in Appendix B.
Intuitively, an algebraic property graph is a collection of elements, each of which has an associated value. Values can be primitive values such as "Hello, world", element references such as , or complex objects which are typed by products and/or coproducts. For example, the value of a knows edge is a pair such as of two vertices representing people: the Person who knows, and the Person who is known. The value of a name property is a pair such as , which contains a Person vertex and a String. A vertex has no value, or rather, it has the trivial value . As in familiar property graph APIs like TinkerPop, every element also has an associated label, in this case knows, Person, or name. The type literal String is not a label, but a basic type in the type system of the graph. To every label, there is a schema, which is a type such that the value of an element with a particular label is expected to conform to that type. For example, the schema of knows is , so the value of every element of that label needs to be a valid pair; the “out vertex” must be a Person that exists in the graph, and so must the “in vertex”.
Continuing with the example, a label like knows has a schema like , i.e. , which is a product type. The schema of the Person label is just the unit type , i.e. . One can think of a vertex as “containing no information” other than its identity, whereas an edge also contains a reference to an out-vertex and a reference to an in-vertex. Now, suppose we have Person vertices and (i.e. ), and knows edge (i.e. ), such that . That is to say, the value of is a pair of vertices. As and are vertices, their values are trivial, i.e. . Finally, we require by the identity above that
This expresses the fact that the type of the value of is exactly the schema of the label knows of . An analogous identity holds, trivially, for the vertices and :
For additional examples, see the taxonomy in Section 4.
3.2 Operations on APGs
In this section, we study three collections of operations [Doan:2012:PDI:2401764] on APGs:
set-theoretic/relational operations, such as joining APGs by taking their product and then filtering them according to a condition; and
data integration operations, such as merging APGs according to their primary keys; and
data migration operations, such as changing the schema of an APG according to a “schema mapping”
For each collection of operations above, we define a category and prove the existence of certain universal constructions in that category. Although each category is (necessarily) different, they all make use of a common notion of APG morphism, which they may further restrict and which we define now. A morphism of APGs consists of:
a function , and a function , such that
We say that is natural on .
3.2.1 Querying APGs
APGs support generalizations of most operations from set theory and relational algebra (RA). To make this statement precise, we arrange APGs into a category APG and prove the existence of:
initial objects in APG, which correspond to the empty set in Set and the empty table (of some given arity) RA; and
terminal objects in APG, which correspond to singleton (one element) sets in Set and singleton tables (of some given arity) in RA; and
product objects in APG, which correspond to the cartesian product in Set and CROSS JOIN in RA; and
co-product objects in APG, which correspond to disjoint union in Set and (roughly) OUTER UNION ALL in RA; and
equalizer objects in APG, which corresponds to a bounded comprehension principle in Set and SELECT * FROM AS WHERE in RA.
The operations above can be used as a query language for APGs, subject to the usual limitations of relational operations for graph processing discussed in 2.5.
APGs and their morphisms form a bi-cartesian category with equalizers, APG.
Algorithms for computing the above constructions, and those in the next sections, are described in Appendix C, and are implemented, along with all of the examples in this paper, with the CQL tool.888http://categoricaldata.net/download Coq [Bertot:2010:ITP:1965123] proofs of all the theorems in this paper are also available.999http://categoricaldata.net/APG.v
3.2.2 Integrating APGs
Although the category APG described in the previous section supports joining APGs, it does not always support the dual operation, APG merge / pushout (co-product followed by co-equalizer), because the category APG does not always admit co-equalizers; for example, co-equalizing APG morphisms such that and is not possible.101010We can add quotient types to recover co-equalizers, but there are good and widely-recognized reasons to avoid doing so in data integration contexts [Ghilardi:2006:DID:3029947.3029974]. Hence, we move to a different category to obtain pushouts:
For each schema , the APGs on and the -preserving APG morphisms form a category, -APG-Int. It has co-equalizers of morphisms when , where is the equivalence relation induced by .
In practice, we expect that most morphisms to be co-equalized will be generated by entity resolution or record linkage [Doan:2012:PDI:2401764] algorithms; when elements are matched based on equality, we obtain the analog of set-theoretic union for APGs. Note, however, that matching APG elements based on equality can be too fine a notion of equivalence in situations where element ids are meaningless identifiers, in the sense that for sets, (where indicates isomorphism) does not imply either or . Methods for obtaining suitable (course-enough) matches are described in [schultz_wisnesky_2017] and [wadt], and their further study in the context of APG is left for future work.
We conclude this section with an example of a simple coarse-enough matching strategy, which is to match two APGs with the same schema, where we require that every element have a primary key that does not contain labels. We match elements based on equality of those keys. For example, let and be APGs with one label, PlateNumber, such that
where the first component is a country id, the second a region id, and the third a regional license plate number:
By examining the primary keys, we obtain a matched APG on the same schema:
along with inclusion morphisms ; hence we have constructed a span of APGs. If we go on to pushout this span, we obtained the merged APG
In this way, we can easily compose two such graphs into a larger graph, an operation which has proven to be extremely useful in the context of RDF triple stores, but which tends to be ill-defined and difficult in the context of property graphs. The approach works just as well with typical, simple vertex and edge ids as it does with the compound keys in the above example. Note also that although this example finds a co-span and computes a pushout, it can easily be re-phrased to find a pair of parallel morphisms and compute a co-equalizer [Awodey:2010:CT:2060081].
3.2.3 Migrating APGs
As was the case with integrating APGs, to migrate APGs and their morphisms from one schema to another we must necessarily work with a category other than APG. We begin by describing an example of the kind of APG morphism that is too general to migrate. Because an APG is a functor from the category with four objects, , four generating morphisms, , and one generating equation, , to Set, the category of sets and functions, we may speak of a natural transformation from APG to APG , which consists of a function , a function , a function , and a function , such that the following diagram commutes:
We might expect that an APG morphism would induce a natural transformation of the functors for and , but this is not so, as evidenced by setting , , and . Hence we have:
APGs and their natural transformations form a category, APG-NT. The set of morphisms in this category is a proper subset of the set of morphisms of APG.
APG morphisms need not be natural transformations because they need not be natural on or or , only . In this section, we will require naturality of all four, making our APG morphisms into natural transformations and allowing the theories of sketches [sketches] and functorial data migration [wadt] to apply. Naturality of follows from the others, and all APG morphism are -natural, so we will only actually use two naturality conditions.
Although thinking of APGs as functors is useful for stating the condition we require APG morphisms to satisfy in order to migrate them, for the purposes of actually doing migrating it is usually more useful to replace CD with a larger (in fact, infinite) category constructed as follows. Let be an APG schema over labels . The category is defined as the free bi-cartesian category [seely] (see also Appendix B) generated by labels and base types and morphisms for every .
For each schema , every APG with induces a bi-cartesian functor by setting . Similarly, every -preserving and -natural APG morphism induces a natural transformation between and . These APGs and morphisms form a category, -APG, with co-products and an initial object.
This theorem allows us to change the schema of an APG, and to migrate APGs onto data models besides APG. Because an APG on schema induces a functor , if we are given a category and functor then we may define a -valued functor via pre-composition with :
In practice, we will usually want to restrict to those the preserve base types and values [wadt]. In the case where for some APG schema , then corresponds to a projection of onto another APG schema . When is not an APG schema, we may still interpret as an algebraic database in the sense of [wadt] and a sketch in the sense of [sketches], providing immediate connections to SQL, RDF, and other data models. Dually, we may consider bi-cartesian functors out of .
Because equality in is decidable [seely], when is finitely-presented we can check that is indeed a functor and not merely an assignment of objects to objects and morphisms to morphisms. In particular, this means that we can check if mappings from SQL schemas in categorical normal form [wadt] into APG schemas preserve foreign key constraints, providing a guarantee that an APG can be migrated to a SQL schema via without referential integrity violations. More generally, we can check that arbitrary equational constraints will hold in our materialized SQL database.
Note that is functorial, meaning that given a morphism of APGs , we obtain a morphism of APGs . In other words, is a functor from the category of bi-cartesian functors to the category of functors [wadt]. It is called the model reduct functor in the theory of institutions [manola2014rdf]. It may or may not have left or right adjoints (weak inverses [Awodey:2010:CT:2060081]), called and , respectively, in the sense that although functorial data migration [wadt] allows us to compute functors and from any functor , a priori and need not be bi-cartesian and hence need not correspond to APGs. An APG-centric study of , and , which are sufficient to express most traditional data migration operations [wadt, Doan:2012:PDI:2401764], is left for future work.
We conclude this section by defining a notion of APG schema mapping for APG schemas and , such that each induces a bi-cartesian functor . In practice, we expect this notion will be used to migrate data using , rather than functors directly. An APG schema mapping consists of a function, , taking each label in to a type in , and a function, , taking each label in to a morphism in , where indicates the result of replacing each label with in the type ; the operation is simply structural recursion (“fold”) and is defined in Appendix C. To write these morphisms in we can use a point-free syntax similar to Appendix B, or we may extend our definition of term to include variables , pair projections, case analysis, and “de-referencing” along the morphisms of :
as well as the usual axioms for product and co-products, where indicates substitution:
We assume without loss of generality that each bound variable in a case expression is “fresh”; i.e., does not appear outside that expression. A morphism can then be written as term of type that has one free variable of type . Each schema mapping induces a bi-cartesian functor .
The data migrations expressible as for some schema mapping include dropping labels, duplicating labels, permuting the fields of product types, and, when we add an equality function to our type side, joining of labels. As a simple example, consider the schema with one label of type and the schema with one label of type . Then an example schema mapping sends to and the -morphism is the term with one free variable . The functor converts each APG on to schema by permuting projections and adding .
4 Taxonomy of Elements
In this section, we develop a taxonomy for classifying property graph elements according to their associated labels and schemas. By analogy with the mathematical notion of a graph, property graphs are described in terms of vertices and edges,111111Or nodes and relationships, in addition to similar terms. collectively termed elements, as well as properties, which connect elements to typed values. For example, a name property might connect a User-vertex to a String. Properties themselves are frequently subdivided according to the class of element they attach to; thus, we have vertex properties, edge properties, and even meta-properties. We now provide concrete definitions for these and other concepts in terms of the APG data model defined above.
A vertex label is a label with . Meanwhile, represents the unit value. A -vertex is an element such that . For example, consider a graph containing a User-vertex and a Trip-vertex :
We can display this simple graph visually:
A designated label, such as the empty string , may be used as the label of otherwise “unlabeled” vertices. Note, however, that there is no difference between the value of a labeled and an unlabeled vertex; in both cases, the value is .
An ordinary, binary edge label is a label such that for vertex labels . An -edge is an element such that . For example, consider a graph containing a driver-edge and a rider-edge , which connect Trip-vertex to User-vertices and , respectively:
We have used the term “ordinary” in the definition above because there are many other kinds of elements which may be considered “edges” in certain contexts. For example, suppose we have a label owner where , and Vehicle, User, and Organization are all vertex labels. This is a very useful construction which allows the schema developer to specify either a user or an organization as the owner of a vehicle.
Certain property graph data models even allow higher-order edges; accommodating such elements in APG is just a matter of modifying the above reference to “vertex labels” to labels of an appropriate kind. For example, we might define “edge-vertex edges” in terms of edge labels with , where is an ordinary edge label and is a vertex label. There is an endless variety of such patterns. Similarly, we can generalize to -ary edges (Section 4.5), or to indexed edges (Appendix B).
An ordinary vertex property label121212Although we use the term “property label” in this paper for consistency, the term “property key” is more conventional. is a label such that for vertex label and value type . An -vertex property is an element such that . For example, let have object String with values , and consider a graph containing a single User-vertex with two name-properties :
So called multi-properties, i.e. properties with the same label outgoing from the same element, are supported by default in APGs. In the example above, the User vertex has two distinct name properties, which happen to have different values. In order to disallow multi-properties, we must add a uniqueness constraint, namely:
Where is a property label as defined here.
Apart from vertex properties, most property graph implementations also support edge properties. An ordinary edge property label is a label such that for edge label and value type . For example, we may attach a driverStatus property to the driver edge defined in Section 4.2, where .
We have used the term “ordinary” above to distinguish typical vertex and edge properties from so-called meta-properties, i.e. properties of properties. Rather than say “vertex property property” we say “vertex meta-property”. By analogy, one can speak of “edge meta-properties”, although these are uncommon in practice. For example, we may choose to add a Double-valued confidence meta-property to the name property in the example above.
We can easily generalize the value type of “ordinary” properties from base types to arbitrary types that do not contain labels, such as , or for more expressiveness, we may allow labels, keeping properties distinct from edges by requiring only that we can recursively “de-reference” all labels to obtain a non-label-containing type. For example, if and , then driverStatus is still a generalized property label. The Status label is called an alias; see below.
The APG data model uses the term “element” more broadly than typical property graph frameworks; vertices and edges are elements, but so are properties, as well as all other product and sum types to which we have given a label in the schema. Whereas the value of edge or property contains two other values, and the value of a vertex contains no values, the elements we call aliases contain a single value.
Specifically, a data type alias is a label such that is a primitive type (), or more generally, any type not containing labels. More generally still, we may require only that “de-references” into a type that does not contain labels; see Section 4.3 above. For example, consider a graph and schema in which DegreesLatitude and DegreesLongitude are both aliases for a primitive Double type, the values of which are numbers such as 37.78 and -122.42:
Another useful kind of alias is a vertex tag. This term is not commonplace in property graph APIs, but tags frequently crop up in enterprise settings, where they serve as a kind of unary “edge” which is simpler than either a (binary) edge or a property. Specifically, a vertex tag label is a label for which , where is a vertex label. Edge tag labels, property tag labels, and so on can be defined analogously. A vertex tag, edge tag, etc. is an element such that , where is a tag label of the appropriate kind. For example, consider tag labels Completed, Updated, and Cancelled which are to be applied to Trip vertices:
This graph expresses the fact that trips and are both Completed, whereas is also Updated. The tag label Cancelled is not used in this graph, yet it is part of the schema. As the tags are themselves elements, an application could extend this schema by adding property labels that qualify the tagging relationships, such as asOfTime or comments.
Although the reader might expect hyperelements, i.e. generalized relationships, to be complicated, they are in fact the simplest kind of label in this taxonomy: literally every possible label is a hyperelement label. Nonetheless, it is sometimes useful to describe as “hyperedges” or “hyperelements” those labels that do not belong to any more constrained class in the taxonomy. In the following graph, for example, we directly translate the Thrift example from Figure 3 by treating a Trip element not as a vertex, nor as a binary edge, but as a hyperelement; every Trip connects two User vertices with two PlaceEvent elements. The primary key of the Trip, called in the Thrift IDL, is implicit. We will illustrate the schema using the RDF example graph from Figure 1, also adding a timestamp field:
Note that the User fields are “required” in Trip, whereas PlaceEvent fields are “optional”; per the example, trip has a pickup event, but no dropoff event. This optionality is encapsulated in the type ; values with this type are either a trivial value , i.e. the unit value “on the left”, or a value “on the right”, where is a PlaceEvent element. Note also that even though PlaceEvent elements have the shape of (generalized) vertex properties, we choose not to think of them as such because PlaceEvent does not have the intuitive semantics of a property label.
The above is a realistic, although simplified, enterprise schema; no clear distinction has been made between vertices, edges, and properties, as in this case, such a distinction is not useful. Nonetheless, we can treat the graph as a “property graph” insofar as we can readily compute shortest paths, connected components, PageRank, etc, in addition to many other operations we typically perform on more conventional graph datasets.
In this paper, we have provided a sound mathematical basis for a family of data models we call algebraic property graphs, representing a bridge between heavily used graph and non-graph data models, broadening the scope of graph computing and lowering the barrier to building enterprise knowledge graphs at scale. Among many possible ways of standardizing the popular notion of a property graph, we believe the use of algebraic data types is especially promising due to their ubiquity and conceptual simplicity. The incorporation of this approach into the design of TinkerPop 4 is currently in progress. In addition, the details of the relationship between algebraic property graphs and CQL / algebraic databases [wadt] is a promising line of future work. One particularly interesting application would be to bring APG together with a relational streaming framework such as [akidau2015dataflow], extending the property graph idiom to complex event processing.
We would particularly like to thank Jan Hidders for contributing a set-based formalism which aligns APG with the emerging property graph standard, and we expect to revisit this model when the standard is published. Jeffrey Brown, Marko A. Rodriguez, Gregory Sharp, and Henry Story provided valuable feedback on this paper, while many other individuals in the graph database community, at Uber, and at Conexus AI have motivated and influenced our design. David Spivak and Gregory Malecha gave valuable guidance about category theory and Coq, respectively.
Appendix A Appendix: Category Theory
In this section, we review standard definitions and results from category theory [Awodey:2010:CT:2060081][mac2013categories][fong2019invitation], which serves as an inter-lingua, or meta-mathematics, between various mathematical fields. For example, using category theory we are able to transport theorems from algebra to topology. As such, it is a natural choice for describing families of related data models, and we are far from alone in proposing its use in this manner (see e.g., [Alagic:2001:MTG:648294.754671] and [Fleming:2003:DC:770018.770021]).
A category is an axiomatically-defined algebraic structure similar to a group, ring, or monoid. It consists of:
a set, , the members of which we call objects, and
for every two objects , a set , the members of which we call morphisms (or arrows) from to , and
for every three objects , a function , which we call composition, and
for every object , an arrow , which we call the identity for .
We may write instead of , and drop object subscripts on and , when doing so does not create ambiguity. The data above must obey axioms stating that is associative and is its unit:
Two morphisms and such that and are said to be an isomorphism. An example category with three objects and six morphisms is shown below.
Given a graph , the so-called free category generated by has for objects the vertices of , and for morphisms the possibly 0-length paths in . A graph inducing the category example from above is:
As another example, the graph with two vertices and two edges and gives rise to a category with two objects, and infinitely many morphisms in the guise of all paths through , namely, , etc.
A functor between categories and consists of:
a function , and
for every , a function , where we may omit object subscripts when they can be inferred from context, such that
The image of the running example category above under a functor is:
A natural transformation between functors consists of a set of morphisms , indexed by objects in , called the components of , such that for every in , . This set of equations may be conveniently rendered as a commutative diagram:
The diagram above consists of a square (the four morphisms) indicating that all paths that start at the same node and end at the same node are to commute (be equal) in ; in this case, there are two such paths. The upper-left two morphims of a square are called a span, and the bottom-right two morphisms are called a co-span.
A natural transformation is called a natural isomorphism when all of its components are isomorphisms.
a.1 Universal Constructions
A terminal object in a category is such that for every object , there is a unique morphism (read “top”). Dually, an initial object is such that for every object , there is a unique morphism (read “bottom”). Like all constructions described by so-called universal properties, terminal and initial objects are unique up to unique isomorphism, provided they exist. Such objects may be familiar to users of functional programming languages; for example, () and Void are the terminal and initial objects, respectively, of the category of idealized Haskell programs.
For objects and in a category, a product of and is an object , morphisms and called projections, such that for all morphisms and there is a unique morphism called the pairing of and making the diagram below and on the left commute.
Dually, a co-product of and is an object , morphisms and called injections, such that for all morphisms and , there is a unique morphism called the case analysis of and making the diagram below and on the right commute.
Given and , we define:
A category that has (co-)products is called (co-)cartesian and a functor that preserves (co-)products (i.e., if is a (co-)product, so is ) is also called (co-)cartesian. Because terminal and initial objects are identity objects for products and co-products, they serve as the degenerate, 0-ary versions of products and co-products and are included when we require that a category “have products/co-products”.
To pull back a co-span as below is to construct object , sometimes written , and morphisms such that , and moreover, such that for any other such where , there is a unique arrow making and . To push out a span as below is to construct object , sometimes written , and morphisms such that from , such that for any other such where , there is a unique arrow making and .
A pullback with terminal is a product, and a pushout with initial is a co-product. A category with all pullbacks and a terminal object is said to have all limits; dually, having all pushouts and an initial object means having all co-limits.
The equalizer of two morphisms and is an object and morphism such that , and moreover, for any and such that , there is a unique such that . Dually, the co-equalizer of two morphisms and is an object and morphism such that , and moreover, for any and such that , there is a unique such that :
A pullback is equivalent to a product followed by an equalizer, and vice-versa. A pushout is equivalent to a co-product followed by a co-equalizer, and vice versa. Hence we will freely interchange various kinds of limits and co-limits in this paper.
The category of sets and total functions, , has for objects all the “small” sets in some set theory, such as Zermelo-Fraenkel set theory, and a morphism to is a total function represented as a set. The initial object is the empty set and every singleton set is terminal. Products are cartesian products, and co-products are disjoint unions. The equalizer of two arrows is the set and inclusion function, and the co-equalizer of is the set of equivalence classes of under the equivalence relation induced by for every , along with the function taking each element of to its equivalence class.
The category has “small” sets for objects and binary relations as morphisms. Products and co-products coincide in (it is self-dual) and both are given by disjoint union.
Programming languages often form categories, with types as objects programs taking inputs of type and returning outputs of type as morphisms . Products and co-products are common, but equalizers and co-equalizers are not, perhaps because they correspond to dependent types (since they are objects depending on morphisms).
In the category of rings (with identity) Ring the product corresponds to the “direct” product of rings, and the co-product to the “free” product of rings. The terminal ring has a single element corresponding to both the additive and multiplicative identity, and the initial ring is the ring of integers. The equalizer in Ring is just the set-theoretic equalizer (the equalizer of two ring homomorphisms is always a subring). The co-equalizer of two ring homomorphisms and from to is the quotient of by the ideal generated by all elements of the form for .
Appendix B Appendix: Extensions of Algebraic Datatypes
In this section we define the free (indexed) by-cartesian category over a category , following the presentation of [seely]. A type is a term in the grammar:
A morphism is an equivalence class of terms in the grammar:
under the usual axioms for indexed products and co-products:
We define and . For every pair of distinguished labels and , we may define binary products and co-products using as the indexing set. The paper [Gaster96apolymorphic] provides a sound and complete label inference system for the above grammar based on Haskell-like qualified types.
To add function types, we can consider the free bi-cartesian closed category on , which adds exponential objects/types and -terms (or their combinator equivalents, and ) to [gabriel]. Note however that even when is not closed, it can be presented as a -calculus [fiore], where, instead of writing a variable-free combinator we write a variable-containing sequent .
To add (co-)inductive data types such as (co-finite) lists or (co-finite) trees, we can instead consider bi-cartesian categories that admit (co-)initial algebras for polynomial endofuctors, which contain a (co-)“fold” operation that expresses structural (co-)recursion [grust]131313Note that co-inductive types suffice to represent objects in the sense of OO programming [mitchell].. Another alternative is to consider the free topos, which adds a type of propositions Prop, equality predicate eq, set-comprehension in the guise of -terms, and natural numbers object Nat with associated recursor to . Yet another alternative is to consider adding collection monads representing type constructors such as Bag and List to our type theory, along with associated comprehension syntax (or monad combinators, return, map, bind) for forming nested collections [grust]. Note, however, that such complex objects are not necessary for typical applications of property graphs, which avoid the need for them by reifying relationships as edges. For example, a linked list of vertices can be realized in property graphs using something like a vertex type together with two edge types: and ; this pattern is also familiar from the RDF Schema terms rdf:List, rdf:first, and rdf:rest. Recursive types are not even possible in APG unless meta-edges (Section 4.2) and/or hyperelements (Section 4.5) are supported.
Appendix C Appendix: Algorithms on APGs
We start with preliminaries. Because types and terms are inductively defined, we can change their labels and elements using structural recursion, an operation we call (that transports a type along a function taking labels to types), and (that transports data along a function taking elements to values, in addition to transporting types along ). We define:
The identity APG is given by identity functions in Set, and APG composition is given by function compositions in Set. Initial and terminal objects are similarly inherited from Set:
We define , and , as well as as the unique pair of functions in Set.
We define , and , as well as as the unique pair of functions in Set.
We define the co-product APG as:
with evident injection APG morphisms and and case-analysis morphism for every and .
We define the product APG as:
with evident projection APG morphisms and and pairing morphism for every and .
We define the equalizer APG for morphisms as
with the morphism easily seen to be an inclusion. Given a and such that , we construct using similar “if-then-inject” logic as above.
Co-equalizers: let and be APGs such that and , and be APG morphisms such that and , where is the equivalence relation induced by . Writing to indicate the quotient of the set by and and as the canonical operations of injecting into an equivalence class and finding a representative for an equivalence class, we define the co-equalizer APG as
with the morphism given by
The mediating morphism of the universal property is also trivial, involving instead.
Appendix D Appendix: Mappings to External Data Models
In this section, we describe how various “model-ADTs”141414http://rredux.com/mm-adt/#model-adts proposed for Apache Tinkerpop 4 can be specified as APG schemas and data integrity constraints thereon. The property-graph model-ADT has already been described above; see Section 4. The others are described in the sections below.
Note that it is not unexpected that APG can represent these model-ADTs, as each model can also be defined as category of algebraic databases on a constrained schema in sense of [wadt]. Note also that although an APG schema can define each model-ADT, APG gives no direct guidance about any model-ADT-specific query languages one may wish to use to query a particular model-ADT, and similarly for updates. It should possible to adapt a variety of existing query languages to APG; SPARQL is currently used for queries over APG schemas at Uber, while an exploration of other languages, such as Gremlin and Cypher, is left for future work.
It is straightforward to place data integrity constraints, such as primary key constraints, on APG data, and to use these constraints to implement other data models, such as the key-value model-ADT. That is, when and , we are justified in saying that is a primary key for label , and we may treat the element as a key-value pair. For example, a suitable is for some large but finite number of summands (or Nat); another suitable is , which uses the element as a primary key for itself, a similar technique to using all of a table’s columns, together, as a primary key.
The relational model-ADT is represented in APG by requiring each APG label to map to a product type. Our taxonomy (Section 4) can be understood relationally in the same way. That is, each label of an APG has an associated algebraic data type . If we suppose that, for example for a given label , then we can think of as defining a table with two columns, the first holding strings and the second holding natural numbers, and we can store all elements with in that table. In this way, many APG schemas can be completely understood and implemented using the relational model. However, our proposal goes beyond the traditional relational model by allowing domain values, i.e. the objects of relational tuples, to be structured as nested tuples and variants, as opposed to atomic in the sense of Codd. For example, consider values such as , which is a tuple of two objects, the first of which is itself a tuple – a latitude/longitude pair – and the second of which is a variant value which represents a choice of one alternative over another, such as a timestamp in seconds as opposed to a timestamp in milliseconds. We may further extend this graphs-as-relations analogy, allowing relational notions of data integrity constraint such as primary keys (see Section D.1) to generalize to APGs. At the limit of this analogy we arrive at the algebraic databases of [wadt].
To represent a triple store as an APG schema requires a single label mapping to a type that is a three-fold product. Alternatively, because triple stores are directed multi-graphs, we can represent them directly as categories with trivial composition relations, the so-called free categories (see A). In the reverse direction, representing an APG in a triple store is straightforward: let be a category and a functor. The category , called the category of elements or Grothendieck construction for , has for objects all pairs where and . The morphisms from to are all the morphisms in such that . Note that there is an obvious canonical functor . Each category is generated by the graphs in Section 4.1 (along with their implicit associated commutativity constraints, omitted in those diagrams), and can be stored (in e.g. RDF) using the usual graphs-as-triples encoding.
A slightly more idiomatic mapping is used for the translation of APG schemas to RDF at Uber; for example, sum types receive special treatment as OWL classes with either a subclass or an OWL named individual per case, while product types are realized with one OWL object property or datatype property per field. As was mentioned above, RDF-PG mappings have a relatively long history, and many approaches have been explored, including the RDF* [hartig2014reconciliation] mapping which is now the subject of a standardization effort.
d.4 Nested Data
To a first approximation, representing JSON and other nested datasets such as XML reduces to representing nested relational datasets as flat relational datasets using the traditional technique of shredding of nested data into parent/child foreign keys [Bussche1999SimulationOT]. An example of this shredding process is shown in Figure 4.
- 1 Introduction
- 2 Graph and Non-Graph Data Models
- 3 Algebraic Property Graphs
- 4 Taxonomy of Elements
- 5 Conclusion
- A Appendix: Category Theory
- B Appendix: Extensions of Algebraic Datatypes
- C Appendix: Algorithms on APGs
- D Appendix: Mappings to External Data Models