DataJoint: A Simpler Relational Data Model

07/29/2018 ∙ by Dimitri Yatsenko, et al. ∙ Baylor College of Medicine 0

The relational data model offers unrivaled rigor and precision in defining data structure and querying complex data. Yet the use of relational databases in scientific data pipelines is limited due to their perceived unwieldiness. We propose a simplified and conceptually refined relational data model named DataJoint. The model includes a language for schema definition, a language for data queries, and diagramming notation for visualizing entities and relationships among them. The model adheres to the principle of entity normalization, which requires that all data -- both stored and derived -- must be represented by well-formed entity sets. DataJoint's data query language is an algebra on entity sets with five operators that provide matching capabilities to those of other relational query languages with greater clarity due to entity normalization. Practical implementations of DataJoint have been adopted in neuroscience labs for fluent interaction with scientific data pipelines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 25

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 The Relational Data Model

1.1 Core concepts and terminology

The relational data model [Codd, 1970] provides the most rigorous approach to structured data storage and the most precise approach to querying data. The model is defined by the principles of data representation, domain constraints, uniqueness constraints, referential constraints, and declarative queries as summarised in Table 1.

=6pt

|X| Data representation. Data are represented and manipulated in the form of relations. A relation is a set (i.e. an unordered collection) of tuples of values for each of the respective named attributes of the relation. Base relations represent stored data while derived relations are formed from base relations through query expressions. A collection of base relations with their attributes, domain constraints, uniqueness constraints, and referential constraints is called a schema.


Domain constraints. Attribute values are drawn from corresponding attribute domains, i.e. predefined sets of values. Attribute domains may not include relations, keeping data model flat, free of nested structures.


Uniqueness constraints. Tuples within relations are addressed by values of their attributes. To identify and relate data elements, uniqueness constraints are imposed on subsets of attributes, then referred to as keys. One key in a relation is designated as the primary key used for referencing its elements.


Referential constraints. Associations among data are established by means of referential constraints with the help of foreign keys. A referential constraint on relation A referencing relation B allows only those tuples in A whose foreign key attributes match the key attributes of a tuple in B.


Declarative queries. Data queries are formulated through declarative, as opposed to imperative, specifications of sought results. This means that query expressions convey the logic for the result rather than the procedure for obtaining it. Formal languages for query expressions include relational algebra, relational calculus, and SQL SELECT statements.

Table 1: Core principles of the relational data model.

Popular implementations of the relational data model rely on the Structured Query Language (SQL). SQL comprises distinct sublanguages for schema definition, data manipulation, and data queries. SQL thoroughly dominates in the space of relational databases and is often conflated with the relational data model in casual discourse.

Various terminologies are used to describe related concepts from the relational data model (Table 2). Relations are often visualized as tables with attributes corresponding to columns and tuples corresponding to rows. In particular, SQL uses the terms table, column, and row.

=6pt

|X[1,c,p]| X[1,c]| X[1,c]| X[1,c]| Relational & ERM & SQL & DataJoint
& entity set & &
relation & relationship set & table & entity set
tuple & entity & row & entity
domain & value set & data type & data type
attribute & attribute & column or field & attribute
attribute value & attribute value & field value & attribute value
primary key & primary key & primary key & primary key
foreign key & foreign key & foreign key & foreign key or dependency
schema & schema & schema & schema
relational expression

or derived relation & data query & SELECT statement & query expression

Table 2: Corresponding terms used in variants of relational models.

1.2 Conceptual clarification

The relational model is abstract and semantically unconstrained, providing few guidelines for translating problems into database schemas or for forming valid queries. Learning to design databases and to compose valid queries in practical problems requires some form of conceptual clarification: a system of conventions and techniques for intuitive mapping from real-world rules to relational concepts.

A set of formal rules known as normal forms have been devised to test whether a particular schema meets basic quality criteria that preclude redundancies in data storage and anomalies in data manipulations [Kent, 1983]. Relying on the mathematical concept of functional dependencies among the attributes, normal forms remain abstract without a definitive approach to conceptual modeling of real-world problems. When a schema does not meet the normal forms, it is considered unnormalized; whereas redesigning schemas to meet these tests is called normalization. SQL does not provide mechanisms for enforcing or diagnosing normalization in its data definition language. Mastery of normalized schema design requires considerable training.

For data queries, also, the relational data model provides few constraints as to what constitutes a valid and meaningful query. It allows unlimited freedom to compare, match, and combine attributes of relations regardless of their semantic compatibility.

New practitioners entering the field of database programming with their diverse backgrounds and mindsets may conceive of database tables as mere spreadsheets with arbitrary semantics of rows and columns. Extended experience with more sophisticated designs helps develop semantic mappings between real-world information motifs and database design patterns.

The lack of semantic constraint in the relational data model leaves great freedom to experienced developers with honed conceptual skills. But it also allows for a great diversity of incompatible approaches to schema design and data queries and lengthens the path from mediocrity to proficiency.

1.3 The Entity-Relationship Model

A successful effort to clarify the relational data model was laid out by the Entity-Relationship Model (ERM) [Chen, 1976]. The ERM adopts the view that the world can be intuitively modeled to consist of entities and relationships between them them; it prescribes an approach to map this view into relational concepts.

The ERM provides

  1. a general approach for problem analysis and conceptual modeling for schema design

  2. diagramming notation for schema design

  3. guidelines for composing meaningful queries

The central concept in the ERM is that of an entity set: an unordered collection of identifiable items (entities) in the modeled world that share the same set of attributes, are distinguished from each other by the same primary key, and can participate in the same types of relationships with other entity sets. In the ERM, all base relations are either entity sets or relationship sets; these terms effectively subsume the term relation (See Table 2).

A relationship set is a collection of associations linking entities from two or more entity sets. These associations take the form of referential constraints (foreign keys) between relationships sets and entity sets. In addition to foreign keys, entity sets may have secondary attributes describing the relationship.

Although the ERM is mostly known for its approach to schema design, it also prescribes methods to form valid data queries: Queries rely on foreign keys to match entities through the relationships in which they participate.

ERM diagramming notation depicts entity sets and relationship sets, turning the ERM into an effective tool for conceptual modeling and communication between database designers, customers, and management. For example, Figure 1A depicts a simple entity-relationship diagram for a university student database with two entity sets, Student and Department and the relationship majors in between them.

Figure 1: A: A database schema modeling university departments, students, and their major departments in ERM diagramming notation. B: An equivalent DataJoint schema diagram.

The ERM was proposed as a complete data model in its own right but its conceptual refinements were never translated into a practical programming language. Today, the ERM is best known as a diagramming technique for conceptual modeling. Formal courses in database design prescribe a two-phased approach: conceptual modeling using a diagramming tool such as the ERM followed by its translation (logical design) into a relational database schema and SQL [Elmasri and Navathe, 2015, Coronel and Morris, 2016]. Expert database designers internalize ERM precepts and commonly distinguish “entity tables” from “relationship tables” even though SQL makes no such distinctions.

A schema definition language tailored to the ERM will provide greater clarity, making it easier to learn and to use than the more general relational model and SQL. ERM designs naturally produce database schemas that comply with normal forms; the problem of database normalization is translated into a more relatable problem of distinguishing well-defined entities in the modeled world.

For example, in the ERM, the foreign key always references the primary key of the referenced entity set. Therefore, an ERM-specific schema definition language can simplify the use of foreign keys. Consider the following section of SQL table definition code defining a foreign key to the Course table whose primary key comprises attributes (dept, course):

dept char(6) NOT NULL COMMENT
  "abbreviated department name, e.g. BIOL",
course int unsigned NOT NULL COMMENT
  "course number, e.g. 1010",
FOREIGN KEY (dept, course)
  REFERENCES Course(dept, course)

In an ERM-tailored language, the equivalent foreign key definition can be simply

-> Course

This definition clearly and simply communicates the intent of the foreign key: reference elements of the entity set Course, without extraneous detail. The definition combines two jobs in a single step: add the referencing attributes to the table definition and create the foreign key constraint. The resulting definition is not only more succinct but is also less error-prone since changes in the definition of the referenced table will correctly propagate into the referencing table.

This syntax simplification would serve well in designs that follow ERM conventions. In more esoteric uses of the relational model, foreign keys may reference non-primary key attributes and may require special syntax.

The ERM’s conceptual clarity comes at the expense of some loss of representational power, skirting around some of the more esoteric capabilities and complexities of the broader relational model. For example, ERM forgoes representations of multiple functional dependencies between overlapping sets of attributes for which the higher normal forms become applicable. Giving up some representational power for increased conceptual clarity is a fair tradeoff in practical database design.

1.4 The DataJoint model

We introduce DataJoint as a conceptual refinement of the relational data model offering a more expressive and rigorous framework for database programming. The DataJoint model facilitates clear conceptual modeling, efficient schema design, and precise and flexible data queries.

The model has emerged over a decade of continuous development of complex data pipelines for neuroscience experiments [Yatsenko et al., 2015]. DataJoint has allowed researchers with no prior knowledge of databases to collaborate effectively on common data pipelines sustaining data integrity and supporting flexible access. DataJoint is currently implemented as client libraries in MATLAB and Python. These libraries work by transpiling DataJoint queries into SQL before passing them on to conventional relational database systems serving as the backend. In this paper, we crystallize the underlying data model to guide future development. While we focus on concepts and principles, we present the model as a complete scripting language separate from any host programming language.

DataJoint comprises

  • a schema definition language (Sec. 2 and 6)

  • a data manipulation language (Sec. 4)

  • a data query language (Sec. 5)

  • diagramming notation for visualizing relationships between modeled entities (Sec. 3).

In DataJoint, all data exist in the form of relations representing entity sets. As with any set, the order of elements in an entity set is never significant; elements cannot be addressed or identified by their position in the entity set.

Since entity sets are relations, which are often visualized as tables, the terms relation, entity set, and table can be used interchangeably in DataJoint (See Table 2). Individual elements of an entity set may be called entities, entity instances, tuples, or rows; for consistency, we will continue to refer to them as simply elements of their respective entity sets.

The attributes of an entity set may also be called columns or fields. In DataJoint, attributes in relations are always identified by their names and their order is never significant, from the logical point of view, although it may affect the performance of some operations.

1.5 Entity normalization

Entity normalization is a conceptual refinement of the relational data model and the central principle of the DataJoint model. Entity normalization is the requirement that all data must exist in the form of relations that meet the criteria of well-formed entity sets.

These criteria are

  1. All elements of an entity set belong to the same well-defined and readily identified entity type from the model world.

  2. All attributes of an entity set are applicable directly to each of its elements, although some attribute values may be missing (set to null).

  3. All elements of an entity set must be distinguishable form each other by the same primary key.

  4. Primary key attribute values cannot be missing, i.e. set to null.

  5. All elements of an entity set participate in the same types of relationships with other entity sets.

The term entity normalization refers to the procedure of refactoring a schema design that does not meet the above criteria into one that does. In some cases, this may require breaking up some entity sets into multiple entity sets, causing some entities to be represented across multiple entity sets. In other cases, this may require converting attributes into their own entity sets.

Entity normalization entails compliance with the Boyce-Codd normal form while lacking the representational power for the applicability of more complex normal forms [Kent, 1983]. Thus adherence to entity normalization prevents redundancies and data manipulation anomalies that originally motivated the formulation of the classical relational normal forms.

Adherence to entity normalization is the common thread unifying DataJoint’s data definition, data manipulation, and data queries.

2 Schema definition

2.1 Entity set definition

DataJoint differentiates between base entity sets and derived entity sets: the former represent stored data and are persistent whereas the latter arise from query expressions. DataJoint’s schema definition language allows defining base entity sets and dependencies between them. A schema is a collection of base entity sets and their dependencies (See Tables 1 and 2). Base entity sets represent the stored data and are identified by unique names within their schemas.

Let’s define a simple university database for use as a running example. The example will include student information, university courses, and enrollment information. Variations of such university databases have been used in many papers and textbooks on relational design theory and we follow suit here.

Listing 1 begins the university database by declaring entity sets Student,  Department, and StudentMajor as diagrammed in Figure 1B.

::Student
student_id : int unsigned   # university ID
---
first_name      : varchar(40)
last_name       : varchar(40)
sex             : enum(’F’, ’M’, ’U’)
date_of_birth   : date
home_address    : varchar(200) # street address
home_city       : varchar(30)
home_state      : char(2)  # two-letter abbreviation
home_zipcode    : char(10)
home_phone      : varchar(14)
::Department
dept : char(6)   # abbreviated department name, e.g. BIOL
---
dept_name    : varchar(200)  # full department name
dept_address : varchar(200)  # mailing address
dept_phone   : varchar(14)
::StudentMajor
-> Student
---
-> Department
declare_date :  date  # when student declared her major
Listing 1: University database schema definition (Part 1).

Each entity set begins with the line specifying the entity name as

::EntityName

By convention, entity set names describe an individual element of the set in the singular form. For example, entity set Student represents all students in our university.

2.1.1 Attributes and their datatypes

Each line of the entity definition defines an entity attribute in the form

attr_name : datatype   # comment

The comment is optional.

Optionally, the attribute definition may specify a default value v for the attribute:

attr_name = v : datatype  # comment

The default value of null indicates that the attribute value may be altogether omitted in any element of the entity set.

A variety of datatypes may be used. In our examples, we will use the following datatypes familiar from SQL variants.

  • int – 32-bit signed integer

  • int unsigned – 32-bit unsigned integer

  • decimal(n,m) – decimal number with n total digits and m fractional digits

  • char(n), varchar(n) – a string of up to n characters

  • date – calendar date

  • year – calendar year

  • enum(’one’, ’two’ , ’three’) – element from the enumerated set of values

  • double – a double-precision floating point number

::Course
-> Department
course  : int unsigned   # course number, e.g. 1010
---
course_name :  varchar(200)  # e.g. "Cell Biology"
credits     :  decimal(3,1)  # number of credits earned by completing the course
::Term
term_year : year
term      : enum(’Spring’, ’Summer’, ’Fall’)
::Section
-> Course
-> Term
section : char(1)
---
room  :  varchar(12)   # building and room code
::CurrentTerm
---
-> Term
::Enroll
-> Section
-> Student
::LetterGrade
grade : char(2)
---
points : decimal(3,2)
::Grade
-> Enroll
---
-> LetterGrade
Listing 2: University database schema definition (Part 2).
Figure 2: The schema diagram of the university database.

2.2 Primary key

2.2.1 Role in entity integrity

Entity integrity is the guarantee made by the data management process that entities from the real world are reliably and uniquely represented in the database system. Entity integrity breaks down when the process permits duplicate representations or when a previously created representation cannot be reliably recalled for an external entity.

Primary keys are the principal tool for conveying and enforcing entity integrity within the database system. Every entity set must have a primary key: a set of attributes whose values, jointly, distinguish elements of the entity set from one another. No two elements of the same entity set can share the same combination of values of the primary attributes.

In some scenarios, the database system can ensure entity integrity without any measures taken in the external world and the database system can generate primary key values internally and automatically. One such scenario could be a scientific experiment recording transient entities whose data are entered all at once, never needing to be linked back to the external world again. Some entities may exist primarily within the database system, e.g. event logs or application connection sessions; their primary key values may not need to be exposed to the external world either.

For repeated interactions with persistent entities in the external world, the database system cannot accomplish entity integrity by itself; unique identification of real-world entities must be established by the overall data management process. This requires measures to assign and maintain a unique and persistent identifying attribute linked to the external entities. For example, the US Social Security Administration places high value on entity integrity for entities of type “American Worker” and goes to great lengths to ensure that, within its overall data management process, the representation of each worker is reliably associated with the same worker and that every worker is associated with the same representation, every time. The application process for a social security card includes stringent measures to verify one’s identity and the use of another person’s social security number is prosecuted as a serious offense. Many other systems piggyback on the entity integrity effort performed by the Federal Government: social security numbers have gained widespread use as personal identifiers in financial and legal dealings, for example, unrelated to taxes and social security.

In summary, entity integrity is the ability to provide a ready and explicit answer for any entity set to the following question:

How does the overall data management process ensure a robust unique mapping between the primary key values of the entity set and the real entities that it claims to represent?

2.2.2 Primary key declaration

In each table declaration (e.g. Listings 1 and 2), the divider --- separates the primary key attributes above from other attributes below. Each entity set must have a primary key although it may comprise multiple attributes. For brevity, we refer to the attributes in the primary key as primary attributes and all others as secondary attributes.

The primary key may comprise a single attribute such as student_id in Student (Listing 1), multiple attributes such as attributes term_year and term in entity set Term, or no attributes at all as in CurrentTerm (Listing 2). No two elements in Student can share the same value of student_id and any element can be uniquely identified by student_id alone. A pair of elements of Term can be in the same term_year or in the same term but cannot be together in both.

2.2.3 Phantom attribute

In explaining the DataJoint data model, we find helpful to introduce the concept of the phantom attribute that is always present in the primary key of all entity sets. The domain of comprises one element, say, "1", which is also the default value for that is used for inserts. The phantom attribute is contrived and carries no information but it helps reason through some of the less intuitive aspects of dependencies and query operators.

2.2.4 Singleton entity sets

Following the terminology from set theory, the number of elements in an entity set is known as its cardinality and the maximum cardinality of an entity set as its capacity. The capacity of an entity set is the cardinality of the domain of its primary key, which is the product of the cardinalities of the domains of all the primary attributes.

A singleton entity set is an entity set with the capacity of 1. This occurs when the domain of the primary key contains exactly one element. A particular class of singleton entity sets are those with zero attributes in the primary key, with only remaining.

Singleton entity sets contain singular facts. For example, CurrentTerm in Listing 2 is a singleton base entity set whose only element, if present, represents the academic term currently in session.

2.3 Dependencies

2.3.1 Referential integrity

Related data may be distributed across multiple entity sets, creating the potential for dissociations or inconsistencies.

Referential integrity is the guarantee made by the data management process that related data across the database remain present, correctly associated, and mutually consistent.

Referential integrity is predicated upon entity integrity; it relies on the use of primary keys for referencing elements and for relating elements between entity sets.

2.3.2 Declaration of dependencies

In DataJoint, referential integrity is supported through the concept of a dependency. Entity set Dependent is said to have a dependency on entity set Ref when every element of Dependent is defined with respect to a specific element of Ref. A dependency is declared as a separate line of an entity set definition using the right arrow -> notation:

::Dependent
...
-> Ref
...

An entity set may declare multiple dependencies. For example, entity set StudentMajor in Listing 1 declares two dependencies: one on Student from its primary key and another on Department as a secondary attribute.

In this context, Dependent is said to be the dependent entity set and Ref the referenced entity set; every element of Dependent is dependent on its referenced element in Ref. An entity set may have dependencies on multiple entity sets.

2.3.3 Effects of dependencies

Referential integrity prohibits the creation of a dependent element before creating its referenced element. Conversely, a referenced element cannot be deleted before all its dependent elements are deleted.

Declaring a dependency in base entity set Dependent on entity set Ref performs the following actions:

  1. Add the primary key attributes of Ref to the definition of Dependent with the same names and datatypes, skipping any that have already been added by other dependencies. Then every element of Dependent will contain the primary key value of its referenced element in Ref, thereby identifying the referenced element.

  2. Create the referential constraint that precludes elements in Dependent without the corresponding referenced elements in Ref. The constraint is enforced by

    1. prohibiting inserting into Dependent of elements unless they reference an existing elements in Ref,

    2. prohibiting deletes of elements from Ref that have dependent elements in Dependent or propagating such deletes to the dependent elements also,

    3. adding caution to updates any attributes values in referenced elements of Ref, considering that dependent elements in Dependent may have been created with consideration for the values of attributes in the referenced elements (See Section 2.3.4).

  3. Create an index to accelerate searches in Dependent given the primary key values from Ref — if no appropriate index already exists.

The primary key attributes of Ref within Dependent are known as the foreign key. By construction, foreign key attributes have the same names and datatypes as the primary key of the entity set referenced by the dependency. Consequently, an entity set can only have one direct dependency on another entity set. In other words, the graph of dependencies between entity sets only has single edges between pairs of nodes.

2.3.4 Updating referenced elements

DataJoint’s stricter form of referential integrity prescribes a cautious procedure for updating attribute values in referenced elements. Cautious updates must ensure consistency of dependent elements that are created with consideration of the values of their respective referenced elements.

For example, suppose a student enrolls in a course by adding an element to Enroll. This element depends on Section and Student (See Listings 1 and 2 and Figure 2). The system must prevent the creation of an element in Enroll before creating the corresponding elements in Section and Student. Neither should a student or a section be deleted without also deleting corresponding enrollments. Furthermore, special caution must be taken when updating values of referenced elements. If a course section is created with associated enrollments, then updating the number of credits for the course may be restricted.

2.3.5 Singleton dependencies

A dependency on a singleton entity set (Section 2.2.4) is called a singleton dependency. Singleton dependencies obey the same rules as other dependencies. Since the referenced entity set has no new primary key attributes, a singleton dependency may create a dependency with an empty foreign key. Explicit addition of the phantom attribute (Section 2.2.3) on both ends of the dependency reduces singleton dependencies to the more conventional use of foreign keys with as the only attribute.

2.3.6 Derived dependencies

In a dependency declaration, the referenced entity set Ref is most commonly another base entity set. We will refer to such dependencies as base dependencies. All dependencies in the university database in Listings 1 and 2 are base dependencies.

More generally, Ref can be a query expression producing a derived entity set, resulting in a derived dependency. Query expressions are described in Section 5 and we postpone a more detailed discussion of derived dependencies until Section 6.1.

Derived dependencies expand the variety of representable relationships in DataJoint. In particular, derived entity sets allow renaming primary key attributes and make it possible to create multiple dependencies between a pair of entity sets, indirectly.

2.3.7 Primary and secondary dependencies

Dependencies are categorized as primary or secondary depending on whether they are declared from the primary or the secondary sections of the base entity set declaration, i.e. above or below the divider ---.

For example, StudentMajor in Listing 1 declares a primary dependency on Student and a secondary dependency on Department.

A primary dependency indicates a closer relationship between the dependent and referenced entity sets than a secondary dependency. Such a relationship may be described as defining or identifying. The primary dependencies of an entity set become dimensions of the space occupied by its elements. Primary dependencies form long-range links across the schema since the primary attributes propagate across multiple steps of dependencies, allowing direct associations between distant entity sets.

Secondary dependencies create a merely referential relationship. Secondary dependencies become secondary attributes of the dependent entity set with no direct influence on its dependents.

2.3.8 Distinguishing attributes

The primary dependencies of an entity set define the distinct dimensions of the space occupied by its elements. Any primary attributes besides those incorporated through the primary dependencies are called distinguishing. Each distinguishing attribute defines a new dimension of the space available for the elements of the entity set. Distinguishing attributes allow distinguishing multiple elements for each combination of the referenced elements.

Entity sets that have distinguishing attributes are called distinguished. Distinguished entity set are particularly important because they introduce new semantic dimensions to the schema. Undistinguished entity sets only relate distinguished entity sets or provide additional information about their elements.

2.3.9 Homologous attributes

In a well-designed schema, the same distinguishing attribute will not be defined multiple times in different entity sets. It will be defined only once and then propagated through dependencies to other entity sets that require it. When propagating across a primary dependency, a primary attribute remains primary in the dependent entity set and will thereby continue to propagate to its dependents, often spanning chains of dependencies and creating long-range semantic associations across the schema.

Homologous attributes are attributes that trace their origin to the same distinguishing attribute through chains of dependencies.

Namesake attributes are attributes in different entity sets that share the same name.

Using derived dependencies, it is possible to rename attributes in foreign keys so that homologous attributes. Therefore, homologous attributes are not always namesake. It is then possible to have two or more homologous attributes in the same entity set.

Homologous attributes underpin the logic of query expressions.

2.3.10 Relationships

In DataJoint, the term relationship is used rather generally to describe the effects of particular configurations of dependencies between multiple entity sets.

For example, the entity set Enroll in our University Database declares primary dependencies on two entity sets Student and Section. Enroll may be thought to denote a many-to-many relationship between Student and Section.

This definition of relationships differs from that in the Entity-Relationship Model where relationships are more narrowly defined as a special type of relations (See Section 1.3 and Table 2). For example, in ERM, a student’s major might be reasonably modeled as a binary relationship set majors in between Student and Department (Fig. 1A). In a DataJoint schema, the same relationship is captured by defining the entity set StudentMajor (Fig. 1B). The distinction between ERM and DataJoint is mostly semantic but it leads to further notational and conceptual simplifications.

2.3.11 Acyclicity

In DataJoint, dependencies between entity sets are not allowed to form cycles; any dependency diagram is an acyclic directed graph. This constraint may seem arbitrary since the model could be straightforwardly extended to allow cyclic dependencies. However, we find that a ban on cyclic dependencies greatly simplifies the use of the model with no substantial reduction in its representational power. With acyclicity, entity sets may be described as “upstream” and “downstream” of one another according to their topological ordering. As a consequence, DataJoint databases have a well-defined direction of workflow and are often described as “data pipelines”.

Cyclic relationships can still be captured by an acyclic set of dependencies. This commonly requires introducing a downstream entity set to represent the relationship. For example, the supervisor-subordinate relationship is a classic cyclic relationship because it relates the entity set Employee to itself: any employee may be a subordinate to another employee. In other relational data models, this could be implemented with a foreign key from Employee back into Employee. In DataJoint, this cyclic relationship must be modeled by adding a separate downstream entity set Subordinate with two dependencies on Employee as illustrated in Figure 3: a primary dependency to identify the subordinate employee and a secondary dependency to reference his or her manager.

3 Schema diagrams

DataJoint’s schema diagramming notation takes advantage the conceptual refinements of the data model. It is designed to succinctly and clearly communicate all the information necessary to understand the relationships between entities and to construct valid queries.

Figure 2 illustrates the schema diagramming notation using the university database defined in Listings 1 and 2.

Base entity sets are shown as named rectangular nodes connected with edges representing dependencies. Primary dependencies are depicted as solid lines and secondary dependencies as dashed lines. Thick solid lines denote dependencies with the same primary key on both ends, denoting a 1-to-1 relationship: these occur when the dependent entity sets has no distinguishing attributes and no additional primary dependencies.

Schema diagrams are simple graphs, as opposed to multigraphs, meaning that only one dependency can form between any pair of entity sets.

Because dependencies are acyclic (Section 2.3.11), schema diagrams are acyclic directed graphs with dependencies pointing upward: dependent entity sets are below or downstream and referenced entity sets are above or upstream.

4 Data manipulation

Data manipulation refers to operations that change the state of the stored data. In DataJoint, the three operations for data manipulation are insert, delete, and cautious update.

4.1 Insert

The insert command inserts new elements into an existing base entity set. The elements (entities) to be inserted must be fully formed and the entire set is added atomically. Invalid elements are rejected. An element may be invalid if it is incomplete, or if it violates a domain constraint (incorrect value datatype), or if it violates a unique constraint (duplicate value in the primary key), or a referential constraint (no matching entry in a referenced entity set). When an insert is rejected for any one element, the entire insert set is rejected.

Listing 3 illustrates the insert command for adding two elements to entity set Student.

insert Student (student_id, first_name, last_name, sex, date_of_birth, home_address, home_city, home_state, home_zipcode, home_phone):
(1000, ’Rebecca’, ’Sanchez’, ’F’, 1997-09-13, ’6604 Gentry Turnpike Suite 513’, ’Andreaport’, ’MN’, ’29376’, ’(250)428-1836’),
(1001, ’Matthew’, ’Gonzales’, ’M’, 1997-05-17, ’1432 Jessica Freeway Apt. 545’, ’Frazierberg’, ’NE’, ’60485-3810’, ’(699)755-6306x996’)
Listing 3: Inserting two elements into enity set Student

4.2 Delete

The delete command removes a subset of elements from a base entity set and the corresponding dependent subsets from any dependent entity sets, cascading recursively down the chains of dependencies. Deletes cascade recursively to dependent entities according to the foreign keys constraints so that deleting an entity from one entity set triggers the deletion of all matching entities downstream in the data pipeline. Each invocation of the delete command is executed atomically, no matter how many elements are deleted and how far downstream they cascade along dependency chains. To specify the subset of elements to remove, delete relies on restriction operator (Sec. 5.4).

The following snippet illustrates the delete command for deleting a subset of elements from Student and dependent elements in dependent entity sets:

delete Student & student_id > 500 & student_id <= 100

4.3 Cautious update

The update operation modifies the values of individual attributes in an entity set. As explained in Section 2.3.1, updating values of attributes in place is a precarious proposition from the point of view of referential integrity. The primary way to change the state of a DataJoint database is through delete and insert operations, which properly enforce referential constraints. Updates are only performed cautiously in exceptional cases with special consideration of their effects on dependent entity sets.

The following snippet illustrates the update command for specific attributes of a specific element of Student:

update Student & student_id == 1001:
    home_phone: "(713)555-1040x101",
    home_city: "Houston"

5 Queries

5.1 Expressions and operators

In DataJoint, data queries come in the form of query expressions that use query operators to combine base entity sets into new, derived entity sets for precise data queries. Query expressions may be assigned to variables and used in other expressions but they remain only symbolic representations of the result. When the state of the database changes, so does the result of the query expression.

DataJoint has five query operators summarized in Table 3. DataJoint’s query expressions are algebraically closed, meaning that the output of any operator can always be used as input for another.

|X[1,0.3cm] X[1,1.6cm] X[1,1.6cm]| operator & notation & result
restrict (Sec. 5.4) & A \& cond A \\ cond & The subset of all elements of A that meet (or do not meet) condition cond.
join (Sec. 5.5) & A * B & The combination of matching elements of A and B.
project (Sec. 5.6) & A.proj(...) & Select, rename, and calculate attributes for A.
aggregate (Sec. 5.7) & A.aggr(B, ...) & Calculate new attributes for A using aggregation operations on attributes of B.
union (Sec. 5.8) & A + B & The set of elements that are in either A or B or both.

Table 3: DataJoint query operators.

5.2 Operational entity normalization

DataJoint extends the principle of entity normalization (Section 1.5) to query expressions. Every derived entity set must meet all the criteria of a valid entity set. Given well-defined entity set as inputs, any DataJoint operator must produce a well-defined entity set as output with a readily identifiable entity type and primary key.

We refer to this property of DataJoint expressions as operational entity normalization; it sets DataJoint apart from other relational query languages, which limit the formal application of concepts of normalization and entity integrity to base entity sets.

5.3 Uniform logic of binary operators

Binary query operators are those that use two entity sets as inputs. They are restrict (when restricting by an entity set), join, aggregate, and union. Binary operators work by matching values of attributes across their two operands. The choice of the attributes to match defines the semantics of the operation.

Other relational query languages provide multiple ways to specify the choice of matching attributes. For example, the choice may be explicit, requiring a careful examination of the attribute semantics and resulting in unwieldy queries. Another choice are natural joins, wherein namesake attributes are matched on the assumption that they are semantically related. Natural joins lead to frequent mistakes when unrelated attributes happen to share the same names.

DataJoint offers a single uniform method for defining the semantics of binary operators through the notion of homologous namesake attributes as defined in Section 2.3.9. This semantic refinement avoids many of the pitfalls of natural joins and applies more generally to all binary operators. All binary query operators work by restricting the result to those element pairs whose homologous namesake attributes assume equal values.

For example, the entity sets StudentMajor and Enroll (Listings 1 and 2 Figure 2) have homologous namesake attributes student_id and dept, which they acquire through a series of dependencies from the primary keys of Student and Department, respectively.

Then, in the course of the binary operators

join

StudentMajor * Enroll

restrict

StudentMajor & Enroll and
StudentMajor \ Enroll

aggregate

StudentMajor.aggr(
Enroll, n: count())

elements of StudentMajor are paired with elements of Enrolled where values of student_id and dept are equal.

Universal sets allow additional control of attribute matching in binary operators and are described in detail in Section 5.9.

5.4 Restrict

The restriction expression A & cond produces an entity set comprising the subset of entity set A whose elements meet the condition cond. The output is an entity set with the same primary key, the same entity type, and the same attributes as A. The exclusion operator \ yields the complementary subset. Restrict and exclude have similar semantics; we use the term restriction to refer to both operators, considering exclude as only a variant of restrict.

In relational algebra, restriction is also known as selection. We use the term restriction to avoid confusion with SQL’s SELECT statement, which does much more than restrict and where only the WHERE clause performs restriction.

As described in section 4.2, the delete command uses restrictions to specify the subset to be deleted.

Variants of restriction depend on the form of cond as described below.

5.4.1 Restriction by attribute conditions

One form of restriction is when cond is an explicit condition on attribute values. Such conditions may include arithmetic operations, functions, range tests, etc. Some examples are listed in Listing 4.

# Students from Texas
Student & home_state == "TX"
# Students not from Texas
Student & home_state <> "TX"
Student \ home_state == "TX"
# Male students not from Texas
Student & sex == "M" \ home_state == "TX"
Listing 4: Restrictions by attribute conditions.

As with all query expressions, the result of restriction can be assigned to a variable and used in subsequent queries (e.g. Listing 5).

Millennial = Student &
   date_of_birth >= "1980-01-01" &
   date_of_birth < "2001-01-01"
MillennialMale = Millennial & sex == "M"
Listing 5: Assignment and use of relational variables.

5.4.2 Restriction by a key-value mapping

Restriction may be done by a key-value mapping in the form {key1: value1, ..., keyN: valueN}. The key-value pairs where the key is an attribute in A are treated as equality conditions. All other key-value pairs are ignored.

For example, the two query expressions in Listing 6 are equivalent given the current definition of Student.

# restriction by a key-value mapping
Student & {
   first_name: "Alice",
   last_name: "Cooper",
   dept: "MATH"}
# equivalent attribute condition
Student & first_name == "Alice" &
    last_name == "Cooper"
Listing 6: Equivalent expressions using restrictions by a mapping and by attribute conditions. The condition on dept is ignored because it is not an attribute in Student.

Accordingly, a restriction by a mapping yields the original unrestricted set when when none of the mapping’s keys match any attributes in the entity set or when the mapping has no keys: A & {}. Conversely, exclusion by such a mapping yields the empty set. This is illustrated in Listing 7.

# No restriction
Student & {}
Student & {dept: "MATH"}
# Empty set
Student \ {}
Student \ {dept: "MATH}"
Listing 7: Restriction by a mapping with no matching keys

5.4.3 Restriction by a collection

When cond is a collection (e.g. a list or a set) of conditions, then the conditions are applied by logical disjunction (logical OR): The result of A & [cond1, ..., condN] contains all elements of A that meet any of the conditions cond1, ..., condN.

# Students from Oklahoma, New Mexico, or Texas
Student & [
   home_state == "OK",
   home_state == "NM",
   home_state == "TX"]
# An equivalent restriction using the IN operator
Student & home_state in ["OK", "NM", "TX"]
Listing 8: Restrictions by a collection of conditions.

Accordingly, A & [] is empty whereas A \ [] yields the original set A.

5.4.4 Restriction by an And-collection

The special function And represents logical conjunction (logical AND) so that A & And([cond1, ..., condN]) is equivalent to A & cond1 & ... & condN.

Accordingly, the result of A \ And([]) is empty and A & And([]) yields the original set A.

5.4.5 Restriction by a negation

The special function Not represents logical negation so that A & Not(cond) is equivalent to A \ cond.

De Morgan’s laws can be applied to derive logically equivalent expressions. For example, A \ [a, b] is equivalent to A \ a \ b whereas A \ And([a, b]) is equivalent to A & [Not(a), Not(b)] and so forth.

5.4.6 Restriction by an entity set

When cond is another entity set, the result of A & cond comprises all elements from A for which there exists an element with equal values of the homologous namesake attributes in cond. All other attributes are ignored.

Since homologous attributes are semantically related, restriction by an entity set yields intuitive and well-defined results.

Listing 9 illustrates some queries with restrictions by entity sets using our university database.

# Students who have taken classes
Student & Enroll
# Students who have not taken classes
Student \ Enroll
# Students who have not selected a major
Student \ StudentMajor
Listing 9: Queries with restrictions by entity sets.
# Students who have taken Biology classes but no MATH courses
Student &
  (Enroll & dept == "BIOL") \
  (Enroll & dept == "MATH")
# Students who are not taking courses in the current term
Student \ (Enroll & CurrentTerm)
# Students who have taken classes and have chosen a major
Student & Enroll & StudentMajor
# Students who have taken classes in their major
Student & (Enroll & StudentMajor)
# Students who are taking classes outside their major in the current term
Student & (Enroll \ StudentMajor & CurrentTerm)
#Students who have taken classes or have chosen a major
Student & [Enroll, StudentMajor]
Listing 10: Composite restrictions.

In relational algebra, the closest equivalent to DataJoint’s restriction and exclusion operators by an entity set are called, respectively, semijoin and antijoin. We deprecate these confusing terms because these operators are restriction-like and not join-like in their semantics.

5.5 Join

The result of the join operator A * B combines A and B into a single entity set. The primary key of a join is the union of the primary key attributes of its operands. The entity type of the result is the pairing of the entity types of the operands.

DataJoint considers entity sets A and B joinable when all their namesake attributes are also homologous. An attempt to join unjoinable entity sets will raise an exception. Any two entity sets can be made joinable by applying appropriate projection operators (Section 5.6) to its operands to remove or rename namesake attributes that should not be involved in the logic of the join. In addition to stricter semantic clarity, join compatibility ensures unambiguous unique attribute names in the result.

For example, Listing 11 illustrates queries that combine data from multiple base entity sets into a single derived entity set.

# Grade point values
Grade * LetterGrade
# Graded enrollments with complete course and student information
Student * Enroll * Course * Section * Grade * LetterGrade
Listing 11: Combining entities.

Listing 12 illustrates complex queries that include joins.

# Students with ungraded courses in current term
Student & (Enroll * CurrentTerm \ Grade)
# Enrollments before students’ date of birth
Student * Enroll & (term_year <= date_of_birth)
Listing 12: Join in expressions.

5.6 Projection

The projection operator A.proj(attr1, ..., attrN is a unary operator on the input entity set A. Its output is an entity set with the same entity type and the same number of elements as A with some attributes renamed, some secondary attributes omitted, or new calculated attributes introduced. A single projection operator can perform any combination of all three of these functions in a single invocation.

5.6.1 Attribute selection

Projection omits all secondary attributes except those listed as arguments; it retains the primary attributes to guarantee that the output is a well-defined entity set.

Listing 13 shows examples of attribute selection.

# Student id, firs name, and last name
Student.proj(first_name, last_name)
# Student id only
Student.proj()
Listing 13: Selecting attributes.

5.6.2 Renamed attributes

When the attribute specification in a projection call is the form new: old, this renames an existing attribute in the relation from old to new.

For example, the result of the query in Listing 14 renames attribute dept into major.

# rename dept into major
StudentMajor.proj(major: dept)
Listing 14: Renaming attributes.

Renaming is commonly used to control the semantics of binary operators such as joins, restrictions, and aggregations.

For example queries in Listing 15 differ in meaning thanks to the renaming of the matching attribute dept. In queries 1 and 3, dept is among the matching attributes. In Enroll and Grade, dept refers to the department offering the course; in StudentMajor, it denotes the students’ major departments.

In queries 2, 4, and 5, dept is renamed into major; it does not restrict the enrollment records and major becomes a distrinct attribute in the output of the joins. In query 5, the two attributes major and dept are used in a further restriction of the join result.

# 1. Enrollments by students in the same
# major as the department offering course
Enroll & StudentMajor
# 2. Enrollments by students who have
# chosen a major
Enroll & StudentMajor.proj(major: dept)
# 3. Grades in courses within student’s major
Grade * StudentMajor
# 4. Grade and major information
Grade * StudentMajor.proj(major: dept)
# 5. Grade outside chosen major
Grade * StudentMajor.proj(major: dept) & major != dept
Listing 15: Renaming attributes.

Renaming is often necessary when joining one entity with itself to obtain pairings of elements from the same entity set. For example, the query in Listing 16 pairs students but it must rename the primary key attribute in one of the operands to allow pairing and it must rename or project out secondary attributes.

# pairs of students with the same birthdays
Student * Student.proj(
   student_id2: student_id,
   date_of_birth2: date_of_birth) &
      student_id < student_id2 &
      date_of_birth = date_of_birth2
Listing 16: Joining an entity set with itself

Renaming may also be used to make joinable entity sets that are non-joinable due to the presence of non-homologous namesake attributes. By design, our university database lacks unjoinable entity sets but imagine if both Department and Enroll had secondary attributes room that were non-homologous, i.e. not referencing the same original definition. That would make these two entity sets non-joinable because the attribute room would be ambiguous in the output. To make them joinable again, the offending attribute would need to be renamed before the join as in

Enroll.proj(classroom: room, ...) * Department

Here the ellipsis ... instructs to include all the remaining attributes.

5.6.3 Calculated attributes

Projection can also be used to calculate new attributes for each element of the entity set based on the existing attributes of the same element. The new attribute name and the calculations are specified as attr: expression. Calculated attributes are always secondary.

For example, the first query in Listing 17, the attribute total calculates the total points earned by students in each course as the product of attributes credits and points that come from Course and LetterGrade, respectively, for each entry in Grade. As an example of a more precise query, the second query adds (course_name) to the output and restricts to the current term.

# Total grade points
(Grade * Course * LetterGrade).proj(
    total: points * credits)
# Total grade points in current term
(Course * Grade * LetterGrade).proj(
    course_name,
    total: points * credits) & CurrentTerm
Listing 17: Extension: calculated attributes.

5.7 Aggregation

Aggregation is a generalization of the projection operator. Operator A.aggr(B, ...) can do anything that operator A.proj(...) can. In fact, without the argument B, the two operators are exactly equivalent. Aggregation allows adding calculated attributes to each element of A based on aggregation functions over attributes in the matching subsets of elements of B.

Aggregation functions include count, sum, min, max, avg, median, percentile, stddev, var, and others. Aggregation functions cannot be used anywhere else other than in definitions of new attributes in the aggr operator.

Just like in projection, the output of aggregation A.aggr(B, ...) has the same entity class, the same primary key, and the same number of elements as A. Just like in projection, the primary key attributes are always included in the output but may be renamed.

From the point of view of SQL and relational algebra, aggregation is equivalent to the left outer join of A and B on their homologous namesake attributes followed by the GROUP BY operation on the primary key attributes of A.

Generally, A and B may have namesake attributes that are not homologous but then they may not be included in the output due to the ambiguity. Making A and B joinable avoids any ambiguity in referencing attributes.

For example, Listing 18 illustrates aggregation queries for the university database.

# Number of students in each course section
Section.aggr(Enroll, n: count())
# Average grade in each course
Course.aggr(Grade * LetterGrade,
    avg_grade: avg(points))
Listing 18: Calculate summary statistics.

In Listing 19, queries make use of variables storing derived entity sets and then used in new queries.

# 1. Number of students enrolled per section
enrolled = Section.aggr(Enroll, n: count())
# 2. Number of students graded per section
graded = Section.aggr(Grade, m: count())
# 3. Fraction of enrolled students with grades
fraction = (enrolled * graded).proj(
    frac: m/n)
# 4. Student GPA
grades = Course * Grade * LetterGrade
GPA = Student.aggr(grades,
    gpa: sum(points * credits)/
         sum(credits))
# 5. Average GPA across university
# by student major
Department.aggr(StudentMajor * GPA,
    avg_gpa: avg(gpa))
# 6. GPA in current term
Student.aggr(grades & CurrentTerm,
    gpa: sum(points * credits) / sum(credits))
Listing 19: Calculate summary statistics using variables to store subqueries.

5.8 Union

The output of the union operator A + B contains all elements of both A and B. From the perspective of entity normalization, union requires that its operands A and B belong to the same entity type with the same primary key with homologous attributes. In the simplest case, without any secondary attributes, the result is the simple set union.

In the presence of secondary attributes, the secondary attributes must have the same names and datatypes in the two operands. Additionally, the two operands must be disjoint, without any duplicate values of the primary key across both inputs. These definitions avoid any ambiguity of attribute values or entity identity.

5.9 Universal sets

DataJoint allows using universal sets to define virtual entity sets with arbitrary primary key structure for use in query expressions. A universal set denotes the set of all possible elements with given attributes of any possible datatypes. Let U(attr1, ..., attrN) denote the entity set with attributes attr1, , attrnN. In the DataJoint query language, U is a reserved word for defining universal sets. All attributes of a universal set are considered primary. When the attribute list is empty, U() has exactly one element with zero attributes; or it may be thought to have the single phantom attribute as described in Section 2.2.3.

Universal sets become useful for defining entity sets as necessary for query expressions when no suitable base entity sets exist. In expressions, their attributes are allowed to be matched to any namesake attributes without the requirement of being homologous.

Imagine we needed to query the university database for the complete list of students’ home cities and get the number of students from each city. The university database lacks the entity set for cities and states. The virtual entity set U(home_city, home_state) can play the role of the base entity set for cities as illustrated in Listing 20. Although the virtual entity set is nameless, its primary key implies the class type it represents.

# All home cities of students
U(home_city, home_state) & Student
# Total number of students from each city
U(home_city, home_state).aggr(
            Student, n: count())
# Total number of students from each state
U(home_state).aggr(Student, n: count())
# Total number of students in the database
U().aggr(Student, n: count())
Listing 20: Using universal set to create virtual entity sets for City and State.

Note that the result of aggregation on a universal set is restricted to the elements with matches in the aggregated set. In other words, X.aggr(A, ...) is interpreted as (X & A).aggr(A, ...) when X is a universal set.

Universal sets should be used sparingly when no suitable base entity sets already exist. Resorting to a universal set effectively states, “I am defining a new entity set for this query because no other appropriate entity set exists.” In some cases, it may be preferable to define new base entity sets to make queries clearer and more semantically constrained.

For example, we could define base entities City and State and then use them in Student as follows.

# singleton entity set for the United States
::USA
::State
-> USA
state char(2) :  two-letter acronym
---
state_name : varchar(20) #  full state name
::City
city : varchar(30)
-> State
::Student
student_id : int unsigned   # university ID
---
first_name      : varchar(40)
last_name       : varchar(40)
sex             : enum(’F’, ’M’, ’U’)
date_of_birth   : date
home_address    : varchar(200) # street address
-> City
Listing 21: Adding entity sets for State and City to avoid using universal sets in queries.

Then the queries in Listing 20 could be replaced with equivalent queries in Listing 22. The explicit definition of City and State make the queries clearer but requires creating and populating extra base entity sets, which are avoided by using universal sets.

# All home cities of students
City & Student
# Total number of students from each city
City.aggr(Student, n: count())
# Total number of students from each state
State.aggr(Student, n: count())
# Total number of students in the database
USA.aggr(Student, n: count())
Listing 22: Queries equivalent to Listing 20 made possible by adding more base entity sets in Listing 21.

Listing 23 provides a few more query examples where, in the absence of a suitable base entity, a universal entity is used.

# Total number of unique last names
U().aggr(U(last_name) & Student, n: count())
# Number of students from each state by sex
U(home_state, sex).aggr(Student,
    n: count())
# Student GPA, retaining the sex attribute
GPA = Student.aggr(sex,
    Course * Grade * LetterGrade,
    gpa: sum(points * credits) /
         sum(credits))
# Average GPA by sex
GpaBySex = U(sex).aggr(GPA,
    avg_gpa: avg(gpa))
Listing 23: Aggregation by universal entities.

5.10 No outer joins

DataJoint does not support outer joins as they contradict the principle of operational entity normalization.

In relational algebra and SQL, an outer join of relations A and B adds to its output all the elements of A (left outer join) or B (right outer join), or both (full outer join) in addition to the entire result of the inner joins. Outer joins contradict the concept of entity normalization (Sections 1.5 and 5.2) because it mixes entity types with different primary keys into a single relation. The primary key of an outer join is poorly defined and may have nulls in it, breaking entity integrity. Outer joins are commonly used to prepare input into a GROUP BY operation. DataJoint’s aggr operator accomplishes both steps (left join and GROUP BY) as a single operator, always observing entity normalization in its operations and reducing the need for outer joins.

6 Advanced schema definition

6.1 Uses of derived dependencies

Section 2.3.6 introduced the concept of derived dependencies. In a derived dependency -> Ref, the referenced entity set is a query expression rather than a base entity set. Derived dependencies enrich the representational power of DataJoint schemas.

Importantly, the query expression in the derived dependency may not rely on downstream entity set to prevent cyclic dependencies.

Each dependency -> Ref references only the primary key of Ref, making the dependency equivalent to the empty projection -> Ref.proj().

6.1.1 Dependency on a restriction

Dependency on a restriction -> Ref & cond has the effect of preventing inserts of elements unless they reference an element that meets the condition cond. The restriction has no effect on deletes or on query expressions involving entity sets from both side of the dependency. Dependencies on restricted entity sets differ little from dependencies on the unrestricted entity set. In the schema diagram, the restricted entity entity set is not shown separately and does not have any special notation.

6.1.2 Dependency on a join

Conceptually, a dependency on the join -> A * B is exactly equivalent to the two restricted dependencies

-> A & B
-> B & A

In the case when A and B share no secondary homologous namesake attributes, the dependency is exactly equivalent to the unrestricted separate dependencies

-> A
-> B

Due to these equivalences, schema diagrams do not require special notation for dependencies on joins, showing them as separate dependencies.

6.1.3 Dependency on a projection

Dependencies only use the primary key of the referenced entity set. Hence, the only relevant application of projection is for the renaming of primary key attributes — which is also one of the most useful applications of derived dependencies.

For example, imagine we needed to represent the supervisor-subordinate relationship discussed in Section 2.3.10. Suppose the entity set Employee is already defined with emp_id as the primary key. Then the entity set Subordinate could define two dependencies on Employee as defined in Listing 24 and Figure 3:

::Subordinate
-> Employee
---
-> Employee.proj(manager: emp_id)
Listing 24: Entity set Subordinate expressing the manager-subordinate relationship using a derived dependency.
Figure 3: The schema diagram for the manager-subordinate relationship as defined in Listing 24.

The resulting entity set will have two attributes: the primary key emp_id for the employee herself and the secondary attribute manager. Both attributes are foreign key attributes referring Employee.

As another example, suppose we need to represent prerequisite courses for each course in our university database. Listing 25 defines entity set Prerequisite to represent the prerequisite relationship between courses.

::Prerequisite
-> Course
-> Course.proj(
    pre_dept: dept, pre_course: course)
Listing 25: Entity set Prerequisite.

In the schema diagram, projection is shown as a separate node (Figure 4).

Figure 4: The schema diagram of the university schema including entity set Prerequisite defined in Listing 25. The derived entity set for the derived dependency is shown as a separate node in orange.

6.1.4 Dependency on a union

Dependency on a union finds many uses when the dependent entity set depends on entities in either of two entity sets. For example, imagine that the university database also contains university employees and library members must be either students or employees. This might be implemented with the base entity sets defined in Listing 26, which provides two semantically equivalent definitions of LibraryCard: one with a union dependency and the other with a restricted dependency.

::Person
person_id : int unsigned   # university ID
---
first_name      : varchar(40)
last_name       : varchar(40)
sex             : enum(’F’, ’M’, ’U’)
date_of_birth   : date
...
::Student
-> Person
---
admission_date  : date
::Employee
-> Person
---
hire_date : date
::LibraryCard
-> Student.proj() + Employee.proj()
---
expiration  : date
# Alternative nearly equivalent definition
::LibraryCard
-> Person & [Student, Employee]
---
expiration  : date
Listing 26: Alternative university database schema definition with employees and library cards.

6.2 Queries with derived dependencies

Explicit use of query expressions to define dependencies simplifies the formulation of valid queries. The query expressions defining the dependency directly translate into queries or subqueries relating the dependent and referenced entity sets.

Listing 27 illustrates how the query expression used in the definition of the dependency in Prerequisite is reused to form a data query.

# Course BIOL-1010
c = Prerequisite & {dept: "BIOL", course: 1010})
# Prerequsite details for the course
Course.proj(
   pre_dept: dept, pre_course: course) & c
# Sections of prerequisites in current term
(Section & CurrentTerm).proj(
   pre_dept: dept, pre_course: course) & c
Listing 27: Reusing the query expressions from the dependency declaration in Listing 25 in a data query.

6.3 Dependency properties

Dependencies may specify additional properties unique and nullable. They are specified in square brackets in front of the query expression.

-> [unique, nullable] Ref

By default, the dependency is not nullable and is not unique.

6.3.1 Unique dependencies

With a unique dependency, each element of the referenced entity set cannot be referenced by more than one element in the dependent entity set. Effectively, this defines a uniqueness constraint on the foreign key attributes and enforces a 1:1 relationship between the entity sets even when the dependency is not primary.

6.3.2 Nullable dependencies

Only secondary dependencies may be declared with the nullable option. This indicates an optional dependency where the dependent foreign key attributes may be omitted, i.e. set to null. A unique dependency may also be nullable, in which case the null references are exempted from the uniqueness constraint.

6.4 Master-part relationship

Let compositional integrity denote the guarantee made by the data management process that entities composed of multiple parts always appear in their complete form.

In DataJoint, compositional integrity is formalized through the notion of a master-part relationship. In a master-part relationship, one base entity set is designated as the master. A master may have one or several dependent part entity sets. In DataJoint, the names of part entity sets are prefixed with the name of their master entity set.

For example, Listing 28 defines a database schema with the master-part relationship between Order and Order.Item.

::Customer
customer : char(16)  # customer id
---
customer_name : varchar(128)
customer_address : varchar(255)
::Product
product : char(16)  # SKU number
---
product_name : varchar(120)
product_description : varchar(4000)
::Order
order  : int
---
-> Customer
order_datetime : datetime
::Order.Item
-> master
item : int
---
-> Product
unit  :  char(10)  # unit of measurement
price : decimal(7,2)  # price per unit
quantity : decimal(7,3)
::Shipment
-> Order
---
tracking_number  : char(36)
ship_datetime : datetime
Listing 28: A schema containing a master-part relationship between Order and Order.Item.

Figure 5 depicts the corresponding schema diagram. Note that in the schema diagramming notation, part entity sets are depicted as nodes with no boxes.

Figure 5: The diagram of the schema defined in Listing 28, containing a master-part relationship between Order and Order.Item.

The master-part relationship informs the application that the master and all its parts should always appear together or not at all. When inserting new data, applications must start an atomic transaction when inserting into the master and commit the transaction only when all the corresponding parts have been inserted. Direct deletes from part entity sets are not allowed: their data may only be deleted by deleting from their master, cascading to the parts. This process guarantees that other users of the database never see partial data and that interrupted processes never leave incomplete data.

The master-part relationship has important implications for dependencies. When a downstream entity set forms a dependency on a master entity set, it may also be considered depended on all its parts as well.

In our example, every element in Order includes multiple dependent elements in Order.Item. While the items are inserted, the order is not made visible to other processes. Once all the associated items are inserted, the entire order is committed and made visible to other agents accessing the database. After that, elements of Order.Item cannot be deleted directly. An order with all its items can be considered as one indivisible entity. Any entity that is dependent on the master becomes also dependent on all its parts. For example, an element of Shipment is dependent on an element of Order with all associated elements of Order.Item. The master-part pattern is suitable for representing cohesive groupings of elements within and across entity sets.

The DataJoint model does not allow nested master-part relationships: a part in one relationship cannot be a master in another. This constraint is another tradeoff between representational power and simplicity. Nested master-part relationships would also require nesting transactions whereas popular database engines only support one level of transactions.

7 Computation

7.1 Programming language embedding

In this paper, the DataJoint model is presented as a stand-alone language with its own syntax. However, its data manipulation and query sublanguages translate straightforwardly into bindings in any programming language with rich object-oriented programming constructs. Indeed, DataJoint evolved from original MATLAB and Python implementations [Yatsenko et al., 2015].

This section describe computations only briefly, focusing on key concepts. A full implementation of computed entity sets is defined in the context of the host programming language.

7.2 Computed entity sets

DataJoint supports a well-defined approach to computations by defining computed entity sets. A computed entity set is defined using the same notation as other base entity sets as described in Section 2. However, instead of creating its elements using the insert command, a computed entity set computes its contents automatically.

To accomplish this, the computed entity set must define a make function on its primary dependency domain.

7.3 Primary dependency domain

As discussed previously, the primary dependencies and distinguishing attributes of an entity set serve as the dimensions of the entity sets.

Consider the entity set Computed defined as

::Computed
-> P
-> Q
r : int      # distinguishing attribute
---
value : double   # computed value

Then the elements of Computed may be thought to occupy the three-dimensional space with axes P.proj(), Q.proj(), and r.

The primary dependencies define the primary dependency domain. In the case of Computed, the primary dependency domain is the join D = P.proj() * Q.proj(), comprising the allowed values of the primary key attributes of Computed.

7.4 The make function

The function s = Computed.make(t) must be defined to yield the set of elements s for any element t pulled from D. If Computed does not have any distinguishing attributes, then s will contain only one element. With distinguishing attributes, s will include multiple elements with different values of the distinguishing attribute. Every element of s must define all the required attributes.

Once the make function is defined, the entity set can be automatically populated, i.e. inserted into Computed.

8 Discussion

In the half century since the emergence of the relational model for databases, its use has been standardized by the wide adoption of SQL. Even so, both SQL and the relational data model lack conceptual clarity, making them unwieldy for conceptual design and for complex queries. Computer scientists have converged on teaching a two-phased approach consisting of conceptual modeling followed by logical modeling prior to implementation.

DataJoint represents a relational data model that is more conceptually refined than earlier incarnations. The model presented in this paper constitutes a complete query language for data definition, data manipulation, and data queries. This conceptual refinement is summarized by the concept of entity normalization (Sections 1.5 and 5.2). A single data definition language and the schema diagramming notation are sufficiently clear to effectively communicate the overall design within the development team, with management and customers.

The data definition language provides well-defined constructs for data integrity including entity integrity (Section 2.2.1), referential integrity (Section 2.3.1, and compositional integrity (Section 6.4).

DataJoint’s conceptual refinements resulted in a minimal query language for succinct and expressive data queries. For reference, Section A lists equivalent SQL and DataJoint queries for examples from the main text.

The syntax of query expressions is algebraically closed and uses binary and unary operators (Section 5.1), which allows straightforward embedding in most object-oriented programming languages with only minor syntactical differences from the query languages described in this paper [Yatsenko et al., 2015].

The DataJoint model evolved through numerous refinements while in active continuous use for processing vast amounts of data from scientific experiments. This paper summarizes the ground principles that will drive further improvements and the current and emerging implementations of the model.

9 Acknowledgements

We thank Fabian Sinz, Christopher Turner, and Jacob Reimer for their critical review and thorough feedback on this manuscript.

We conceived DataJoint in Andreas S. Tolias’ lab in the Neuroscience Department at Baylor College of Medicine in the fall of 2009. Initially implemented as a thin MySQL API in MATLAB, it defined the major principles of the DataJoint model summarized here.

Many students and postdocs in the lab as well as collaborators and early adopters have contributed to the project. Jacob Reimer and Emmanouil Froudarakis became early adopters in Andreas Tolias’ Lab and propelled development. Alexander S. Ecker, Philipp Berens, Andreas Hoenselaar, and R. James Cotton contributed to the formulation of the overall requirements for the data model and critical reviews of DataJoint development.

Outside the Tolias lab, the first labs to adopt DataJoint (approx. 2010) were the labs of Athanassios G. Siapas at CalTech, Laura Busse and Steffen Katzner at the University of Tübingen.

In 2015, the Python implementation gained momentum with Edgar Y. Walker and Fabian Sinz joining as principal contributors.

In 2017, DARPA awarded a small-business innovation research grant to Vathes LLC (Contract D17PC00162) to further develop and publicize the DataJoint framework.

In June 2018, the Princeton Neuroscience Institute, under the leadership of Prof. Carlos Brody, began funding a project to generate a detailed DataJoint user manual.

Appendix A SQL Translations

This section contains translations of a selection of DataJoint queries from the main text into SQL.

CREATE TABLE Student (
   student_id int unsigned NOT NULL COMMENT "university ID",
   first_name varchar(40) NOT NULL,
   last_name varchar(40) NOT NULL,
   sex enum(’F’, ’M’, ’U’) NOT NULL,
   date_of_birth date NOT NULL,
   home_address varchar(200) NOT NULL COMMENT "street address",
   home_city varchar(30) NOT NULL,
   home_state char(2) NOT NULL COMMENT "two-letter abbreviation",
   home_zipcode varchar(10) NOT NULL,
   home_phone varchar(14) NOT NULL,
   PRIMARY KEY (student_id))
CREATE TABLE Department(
   dept char(6) NOT NULL COMMENT "abbreviated department name, e.g. BIOL",
   dept_name varchar(200) NOT NULL COMMENT "full department name",
   dept_address varchar(200) NOT NULL COMMENT "mailing address",
   dept_phone varchar(14),
   PRIMARY KEY(dept))
CREATE TABLE StudentMajor(
   student_id int unsigned NOT NULL COMMENT "university ID",
   dept char(6) NOT NULL COMMENT "abbreviated department name, e.g. BIOL",
   declare_date date NOT NULL COMMENT "when student declared her major",
   PRIMARY KEY (student_id),
   FOREIGN KEY (student_id) REFERENCES Student(student_id),
   FOREIGN KEY (dept) REFERENCES Department(dept))
Listing 29: SQL translations of queries from Listing 1.
#  Student & Enroll
SELECT *
FROM Student
WHERE student_id IN (
        SELECT student_id
        FROM Enroll)
#  Student \ Enroll
SELECT *
FROM Student
WHERE student_id NOT IN (
        SELECT student_id
        FROM Enroll)
Listing 30: SQL translations of queries from Listing 9
#  Student &
#    (Enroll & dept == "BIOL") \
#    (Enroll & dept == "MATH")
SELECT *
FROM Student
WHERE student_id IN (
        SELECT student_id
        FROM Enroll
        WHERE dept == "BIOL")
    AND student_id NOT IN (
        SELECT student_id
        FROM Enroll
        WHERE dept == "MATH")
#  Student \ (Enroll & CurrentTerm)
SELECT *
FROM Student
WHERE student_id NOT IN (
        SELECT student_id
        FROM Enroll
        WHERE (student_id, term_year, term) IN (
            SELECT student_id, term_year, term
            FROM CurrentTerm))
#  Student & (Enroll \ StudentMajor & CurrentTerm)
SELECT *
FROM Student
WHERE student_id IN (
        SELECT student_id
        FROM Enroll
        WHERE (student_id, dept) NOT IN (
                SELECT student_id, dept
                FROM StudentMajor)
            AND (term_year, term) IN (
                SELECT term_year, term
                FROM CurrentTerm))
#  Student & [Enroll, StudentMajor]
SELECT *
FROM Student
WHERE student_id IN (
        SELECT student_id
        FROM Enroll)
    OR student_id IN (
        SELECT student_id
        FROM StudentMajor
Listing 31: SQL translations of queries from Listing 10.

References

  • Codd [1970] E. F. Codd. A Relational Model of Data for Large Shared Data Banks. Commun. ACM, 13(6):377–387, June 1970. ISSN 0001-0782. doi: 10.1145/362384.362685. URL http://doi.acm.org/10.1145/362384.362685.
  • Kent [1983] William Kent. A simple guide to five normal forms in relational database theory. Communications of the ACM, 26(2):120–125, 1983.
  • Chen [1976] Peter Pin-Shan Chen. The entity-relationship model — toward a unified view of data. ACM Transactions on Database Systems (TODS), 1(1):9–36, 1976.
  • Elmasri and Navathe [2015] Ramez Elmasri and Shamkant Navathe. Fundamentals of database systems. Addison-Wesley Publishing Company, 7th edition, 2015.
  • Coronel and Morris [2016] Carlos Coronel and Steven Morris. Database systems: design, implementation, & management. Cengage Learning, 2016.
  • Yatsenko et al. [2015] Dimitri Yatsenko, Jacob Reimer, Alexander S. Ecker, Edgar Y. Walker, Fabian Sinz, Philipp Berens, Andreas Hoenselaar, Ronald James Cotton, Athanassios S. Siapas, and Andreas S. Tolias. DataJoint: managing big scientific data using MATLAB or Python. bioRxiv, page 031658, November 2015. doi: 10.1101/031658. URL http://biorxiv.org/content/early/2015/11/14/031658.