Deep Models for Relational Databases

03/21/2019 ∙ by Devon Graham, et al. ∙ The University of British Columbia 0

Due to its extensive use in databases, the relational model is ubiquitous in representing big-data. We propose to apply deep learning to this type of relational data by introducing an Equivariant Relational Layer (ERL), a neural network layer derived from the entity-relationship model of the database. Our layer relies on identification of exchangeabilities in the relational data(base), and their expression as a permutation group. We prove that an ERL is an optimal parameter-sharing scheme under the given exchangeability constraints, and subsumes recently introduced deep models for sets, exchangeable tensors, and graphs. The proposed model has a linear complexity in the size of the relational data, and it can be used for both inductive and transductive reasoning in databases, including the prediction of missing records, and database embedding. This opens the door to the application of deep learning to one of the most abundant forms of data.



There are no comments yet.


page 6

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the relational model of data, we have a set of entities, and one or more instances of each entity. These instances interact with each other through a set of fixed relations between entities. A set of attributes may be associated with each type of entity and relation.111An alternative terminology refers to instances and entities as entities and entity types. This simple idea is widely used to represent data, often in the form of a relational database, across a variety of domains, from shopping records, social networking data and health records, to heterogeneous data from astronomical surveys.

Figure 1: (a) The Entity-Relationship (ER) diagram for our running example, with three entities: student, course and professor (labeled 1,2 and 3 respectively), three pairwise relations: takes (student-course, represented by ), writes reference (student-professor; ), and teaches (professor-course; ), and one self-self relation: prerequisite (course-course; ). The full set of relations is . Both entities and relations have associated attributes — e.g., when a student takes a course, they receive a grade. The singleton relation can encode student attribute(s) such as number of internships, or salary after graduating. Since each course has a single professor as its teacher, this relation is one-to-many. (b) Some possible relational tables in an instantiation of the ER diagram of (a). There are instances of student, instances of course and instances of professor. The attributes associated with each entity and relation are stored in the corresponding table —e.g., table suggests that the student 5 took course 4 and received a grade of 93 and tells us that student 5 completed 3 internships and earns a salary of $100. (c) Sparse tensor representation

of the tables of (b). The vectorized form of this set of sparse tensors,

—the column-vector concatenation of ’s— is the input to our neural layer . Here, the parameter-tying in the weight matrix of ERL (Fig. 2) guarantees that any permutation of elements of due to shuffling of entities in this tensor representation, results in the same permutation of the output of the layer.

Learning and inference on relational data has been the topic of research in machine learning for the past decades. The relational model is closely related to first order and predicate logic, where the existence of a relation between instances becomes a truth statement about a world. The same formalism is used in AI through a probabilistic approach to logic, where the field of

statistical relational learning has fused the relational model with the framework of probabilistic graphical models. Examples of such models include, plate models, probabilistic relational models, Markov logic networks and relational dependency networks (Getoor & Taskar, 2007).

A closely related area that has enjoyed an accelerated growth in recent years is relational and geometric deep learning, where the term “relational” is used to denote the inductive bias introduced by a graph structure (Battaglia et al., 2018). Although within the machine learning community, relational and graph-based terms are often used interchangeably, they could refer to different data structures: graph-based models, such as graph databases (Robinson et al., 2013)

and knowledge graphs 

(Nickel et al., 2016) simply represent the data as an attributed (hyper-)graph, while the relational model, extensively used in databases, use the entity-relation (ER) diagram (Chen, 1976) to constrain the relations of each instance (corresponding to a node), based on its entity-type; see Fig. 1(a) and related works Section 6.

Here, we use an alternative inductive bias, namely invariance and equivariance, to encode the structure of relational data. This type of bias, informs a model’s behaviour under various transformations, and it is built on group theory rather than graph theory. Equivariant models have been successfully used for deep learning with variety of topologically distinct structures, from images and graphs to sets and spheres. By adopting this perspective, we present a maximally expressive neural layer that achieves equivariance wrt exchangeabilities in relational data. Our model generalizes recently proposed models for sets (Zaheer et al., 2017), exchangeable tensors (Hartford et al., 2018), as well as equivariant models for attributed hyper-graphs (Maron et al., 2018).

2 Representations of Relational Data

To formalize the relational model, we have a set of entities, and a set of instances of each entity . These entities interact with each other through a set of relations . For each relation , where , we observe data in the form of a set of tuples . Note that the singleton relation can be used to incorporate individual entity attributes (such as professors’ evaluations in Fig. 1-a). For a full list of notation used in this paper, see Table 1.

In the most general case, we allow for both , and any to be multisets (i.e., to contain duplicate entries). is a multiset if we have multiple relations between the same set of entities. For example, we may have a supervises relation between students and professors, in addition to the writes reference relation. A particular relation is a multiset if it contains multiple copies of the same entity. Such relations are ubiquitous in the real world, describing for example, the connection graph of a social network, the sale/purchase relationships between a group of companies, or, in our running example, the course-course relation capturing prerequisite information.

For our initial derivations we make the simplifying assumption that each attribute is a scalar. Later, we extend this to . In general, this attribute can be a complex object, such as text, image or video, encoded (decoded) to (from) a feature-vector, using a deep model, in and end-to-end training. Another common feature of relational data is the one-to-many relation, addressed in Appendix B.

2.1 Tuples, Tables and Tensors

In relational databases the set of tuples is often represented using a table, with one row for each tuple; see Fig. 1(b). An equivalent representation for is using a “sparse” -dimensional tensor , where each dimension of this tensor corresponds to an entity , and the the length of that dimension is the number of instances . In other words

We work with this tensor representation of relational data. We use to denote the set of all sparse tensors that define our relational data(base); see Fig. 1(c). For the following discussions around exchangeability and equivariance, w.l.o.g. we assume that for all , are fully observed, dense tensors. Subsequently, we will discard this assumption and attempt to make predictions for (any subset of) the missing records.

Multiset Relations.

Because we allow a relation to contain the same entities multiple times, we formally define a multiset as a tuple , where is a set, and maps elements of to their multiset counts. We will call the elements of the multiset , and the count of element . We define the union and intersection of two multisets and as

In general, we may also refer to a multiset using typical set notation (e.g.). We will use bracketed superscripts to distinguish distinct but equal members of any multiset (e.g.). The ordering of equal members is specified by context or arbitrarily. The size of a multiset accounts for multiplicities: .

3 Exchangeabilities of Relational Data(bases)

Recall that in the representation , each entity has a set of instances indexed by . However, this ordering is arbitrary and we can shuffle these instances, affecting only the representation, and not the “content” of our relational data. However, in order to maintain consistency across data tables, we also have to shuffle all the tensors , where , using the same permutation applied to the tensor dimension corresponding to . At a high level, this simple indifference to shuffling defines the exchangeabilities of relational data. A mathematical group formalizes this idea.

A mathematical group is a set equipped with a binary operation between its members, such that the set and the operation satisfy closure, associativity, invertability and existence of a unique identity element. refers to the symmetric group, the group of “all” permutations of objects. A natural representation for a member of this group , is a permutation matrix . Here, the binary group operation is the same as the product of permutation matrices. In this notation, is the group of all permutations of instances of entity . To consider permutations to multiple dimensions of a data tensor we can use the direct product of groups. Given two groups and , the direct product is defined by


That is, the underlying set is the Cartesian product of the underlying sets of and , and the group operation is the component-wise operation.

Observe that we can associate the group with a relational model with entities, where each entity has instances. Intuitively, applying permutations from this group to the corresponding relational data should not affect the underlying contents, while applying permutations from outside this group should. To see this, consider the tensor representation of Fig. 1(c): permuting students, courses or professors shuffles rows or columns of , but preserves its underlying content. However, arbitrary shuffling of the elements of this tensor could alter its content.

Our goal is to define a neural network layer that is “aware” of this structure. For this, we first need to formalize the action of on the vectorized form of .


For each tensor , refers to the total number of elements of tensor (note that for now we are assuming that the tensors are dense). We will refer to as the number of elements of . Then refers to the vectorization of , obtained by successively stacking its elements along its dimensions, where the order of dimensions is given by . We use to refer to the inverse operation of , so that . With a slight abuse of notation, we use to refer to , the vector created by stacking all of the ’s in a column according to a fixed ordering of the relations. The layer that we design later is also applied to this vectorized form of the relational data.

Group Action.

The action of on , permutes the elements of . Our objective is to define this group action by mapping to a group of permutations of objects – i.e., a subgroup of . To this end we need to use two types of matrix product.

Let and be two matrices. The direct sum is an block-diagonal matrix


and the Kronecker product is an matrix


Note that in the special case that both and are permutation matrices, and will also be permutation matrices.

Both of these matrix operations can represent the direct product of permutation groups. That is, given two permutation matrices , and , we can use both and to represent members of . However, the resulting permutation matrices, can be interpreted as different actions: while the direct sum matrix is a permutation of objects, the Kronecker product matrix is a permutation of objects.


Claim 1.

Consider the vectorized relational data of length . The action of on so that the content is preserved, is given by the following permutation group


where the order of relations in is consistent with the ordering used for vectorization of .


The Kronecker product when applied to , permutes the underlying tensor along the axes . Using direct sum, these permutations are applied to each tensor in . The only constraint, enforced by Eq. 5 is to use the same permutation matrix for all when . Therefore any matrix-vector product is a “legal” permutation of , since it only shuffles the instances of each entity. ∎

4 Equivariant Relational Layer (ERL)

Our objective is to design a neural layer —where are the number of input and output channels, and — such that any “legal” transformation of input –as defined in Eq. 5– should result in the same transformation of the output. For clarity, we limit the following definition to the case where , and extend it to multiple channels in Section 4.3.


Definition 1 (Equivariant Relational Layer; ERL).

Let be any permutation of . A fully connected layer with is called an Equivariant Relational Layer if,


That is, an ERL is a layer that commutes with the permutation if and only if is a legal permutaiton, as defined by Eq. 5.

We now propose a procedure to tie the entries of so as to guarantee the conditions of Definition 1. Moreover, we show that the proposed model is the most expressive form of parameter-sharing with this property.

We build up block-wise, with blocks corresponding to pairs of relations :


Here, the parameters are tied only within each block. To concisely express this complex tying scheme, we use the following indexing notation. In its most general form we can allow for particular form of bias parameters in Definition 1; see Appendix A for parameter-tying in the bias.

[style=MyFrame] tuple or column vector (bold lower-case) a tuple tensor, inc. matrix (bold upper-case) set (or multiset) group (caligraphic) set of entities number of instances a set of relations a relation data for a relation relational data vectorization of length of length of parameter matrix (i,j) block of index for and for rows of index for and rows/columns of symmetric group group of “legal” permutations of

Table 1: Summary of Notation
Indexing Notation.

The parameter block is an matrix, where . Given the relation , we use the tuple to index an element in the set . Therefore, can be used as an index for both data block and the rows of parameter block . In particular, to denote an entry of , we use . Moreover, we use to denote the element of corresponding to entity . Note that this is not necessarily the element of the tuple . For example, if and , then and . When is a multiset, we can use to refer the to the element of corresponding to the -th occurrence of entity (where the order corresponds to the ordering of elements in ).

Figure 2: (a) The parameter matrix from Example 1. Each colour represents a unique parameter value. The nine blocks, showing the interaction of three relations, are clearly visible. The arrows indicate the permutation that is being applied. (b) The result of applying a permutation to the rows of . The permutation is permuting the “instances” of student, course and professor. So this corresponds to a “legal” permutation as defined by Eq. 5. We can see that this swapping is applied block-wise to blocks of rows of corresponding to each for which . In the case of , must also be applied to rows within blocks, as these correspond to entity 2 as well. (c) The result of applying the inverse permutation (in this case, the same permutation) to the columns of . Here, we swap columns block-wise in each for which . By doing so we recover the original matrix, as per Lemma 1.

4.1 Parameter Tying

Let and denote two arbitrary elements of the parameter matrix . Our objective is to decide whether or not they should be tied together. For this we define an equivalence relation between index tuples , where is a concatenation of and . All the entries of with “equivalent” indices are tied together. This equivalence relation is based on the equality patterns within and . To study the equality pattern in , we should consider the equality pattern in each sub-tuple , the restriction of to only indices over entity , for each . This is because we can only meaningfully compare indices of the same entity – e.g., we can compare two student indices for equality, but we cannot compare a student index with a course index. We can further partition , into sub-partitions such that two of its indices are in the same partition iff they are equal:


Using this notation, we say iff they have the same equality patterns —i.e., the same partitioning of indices for all :


This means that the total number of free parameters in is the product of the number of possible different partitionings for each entity :


where is the free parameter vector associated with , and is the Bell number that counts the possible partitionings of a set of size ; growth of number of parameters with Bell number was previously shown for equivariant graph networks (Maron et al., 2018), which as we see in Section 6 are closely related, and indeed a special case of our model.

Example 1.

[Fig. 2] To get an intuition for this tying scheme, consider a simplified version of Fig. 1 restricted to three relations , self-relation , and with students, courses, and professors. Then , so and will have nine blocks: and so on.

We use tuple to index the rows and columns of . We also use to index the rows of , and use to index its columns. Other blocks are indexed similarly.

The elements of take different values, depending on whether or not and , for row index and column index . The elements of take different values: The index can only be partitioned in a single way (). However index and indices and all index into the courses table, and so can each potentially refer to the same course. We thus have a unique parameter for each possible combination of equalities between these three items, giving us a factor of different parameter values; see Fig. 2(a), is the upper left block, and is the block to its right.

The center block of Fig. 2(a), produces the effect of on itself. Here, all four index values could refer to the same course, and so there are different parameters.

This parameter-sharing scheme admits a simple recursive form, if the database has no self-relations; see Appendix B.

4.1.1 Achieving ERL with parameter tying

In this section we relate the notion of group-action that led to the definition of an ideal Equivariant Relational Layer (Definition 1), to the parameter-sharing scheme that produced . Starting with the following lemma, we show that our parameter-sharing produces ERL, and any relaxation of it (by untying the parameters) fails the ERL requirements.

Lemma 1.

For any permutation matrices and we have

for any choice of . That is and separately permute the instances of each entity in the multisets and , applying the same permutation to any duplicated entities, as well as to any entities common to both and .

For intuition and the proof of this lemma see Appendix C. See Fig. 2 for a minimal example, demonstrating this lemma. We are now prepared to state our two main Theorems.


Theorem 4.1.

Let be the tensor representation of some relational data, its vectorized form. If we define block-wise according to Eq. 7, with blocks given by the tying scheme of Section 4.1, then the layer is an Equivariant Relational Layer (Definition 1).

See Appendix C for a proof. This theorem assures us that a layer constructed with our parameter-sharing scheme achieves equivariace w.r.t. the exchangeabilities of the relational data. However, one may wonder whether an alternative, more expressive parameter-sharing scheme may be possible. The following theorem proves that this is not the case, and that our model is the most expressive parameter-tying scheme possible for an ERL. [style=MyFrame2]

Theorem 4.2.

Let be the tensor representation of some relational data. If is any parameter matrix other than the one defined in Section 4.1, then the neural network layer is NOT an Equivariant Relational Layer (Definition 1).

4.2 Sparse Tensors

So far, for simplicity we assumed that the tensors are dense. In practice, we often observe a small portion of entries of these tensors. The sparsity does not affect the equivariance properties of the proposed parameter-sharing. In practice, not only do we use a sparse representation of the input tensors, but also produce sparse tensors as the output of the ERL, without compromising its desirable equivariance properties. Using sparse input and output, as well as using pooling operations that reduces the complexity of to “linear” in the number of non-zeros in , make ERL relatively practical; see Appendix B for pooling-based implementation. However, note that still the layer must receive the whole database as its input, and further subsampling techniques are required to apply this model to larger real-world databases.

(a) Transductive
(b) Inductive
Figure 3: Ground truth versus predicted embedding for course instances in both transductive (a) and inductive (b) setting. In the inductive setting, training and test databases contain completely distinct instances. The x-y location encodes the prediction and the size and color of each dot encodes the ground-truth .

4.3 Multiple Layers and Channels

Equivariance is maintained by composition of equivariant functions. This allows us to stack ERL to build “deep” models that operate on relational data(bases). Using multiple input () and output () channels is also possible by replacing the parameter matrix , with the parameter tensor ; while copies have the same parameter-tying pattern —i.e., there is no parameter-sharing “across” channels. The single-channel matrix-vector product in where is now replaced with contraction of two tensors , for .

5 Experiments

Our experiments study the viability of ERLs for deep embedding, prediction of missing records, and inductive reasoning. See Appendix D for details as well as an additional real-world experiment where we predict the outcome of a soccer match from historical records and player information.


To continue with our running example we synthesize a toy dataset, restricted to (student-course), (student-professor), and (professor-course). Each matrix in the relational database, , is produced by first uniformly sampling an h-dimensional embedding for each entity instance , followed by matrix product . A sparse subset of these matrices are observed by our model in the following experiments. Note that while we use a simple matrix product to generate the content of tables from latent factors, the model is oblivious to this generative process.


We use a factorized auto-encoding architecture consisting of a stack of ERLs followed by pooling that produces code matrices for each entity, student, course and professor. The code is then fed to a decoding stack of ERLs to reconstruct the sparse .

(a) Transductive
(b) Inductive
Figure 4: (a) Average mean squared error in predicting missing records in student-course as a function of sparsity level of the whole database , during training (x-axis) and test (y-axis), in the transductive setting. (b) The same test error in the inductive setting where the model is tested on a new database with unseen students, courses and professors. The baseline is predicting the mean value of training observations. At test time, the observed entries are used to predict the values of the fixed, held-out test set.

We use a small embedding dimension for visualization. We use , and the model observes of database entries. Fig. 3(a) visualizes the ground-truth

versus the estimated embedding

for students; see Appendix D for more figures. The figure suggests the produced embedding agrees with the ground truth (note that in the best case, the two embeddings agree up to a diffeomorphism).

Missing Record Prediction.

Here, we set out to predict missing values in student-course table using the observations across the whole database. For this, the factorized auto-encoding architecture is trained to only minimize the reconstruction error for “observed” entries in student-course tensor.

We use , with an embedding size of 10 and continuously change two quantities: 1) the percentage of observed entries (i.e., sparsity of all database tensors) at training time; and 2) once the model is trained, we vary the sparsity of test time observations. Figure 4(a) visualizes the average prediction error over 5 runs as a function of these two quantities for the student-course table. The figure confirms our expectation that the amount of observations during both training and test can increase the prediction accuracy. More importantly, once the model is trained at a particular sparsity level, it shows robustness to the changes in the sparsity level during test time.

Predictive Value of Side-Information.

Our previous experiments raise a question as to whether we gain anything by using the “entire” database for predicting missing entries of a particular table? That is, we could simply make predictions using only the target tensor ( student-course table) for both training and testing. To answer this question, we fix the sparsity level of the student-course table at .1, and train models with increasing levels of sparsity for the side information. That is, we vary the sparsity of the tensors and in the range . Table 2 shows that our holistic approach to predictive analysis is indeed advantageous: side information in the form of student-professor and course-professor tables can improve the prediction of missing records in the student-course table.

Inductive Setting.

Once we finish training the model on a relational dataset, we can apply it to another instantiation — that is a dataset with completely different students, courses and professors. Fig. 3(b) reproduces the embedding experiment, and Fig. 4

(b) shows the missing record prediction results for the inductive setting. In both cases the model performs reasonably well when applied to a new database. This setting can have interesting real-world applications, as it enables transfer learning across databases and allows for predictive analysis without training for new entities in a database as they become available.

6 Related Literature

Here, we briefly review the related literature in different areas. To our knowledge there are no similar frameworks for direct application of deep models to relational databases, and current practice is to automate feature-engineering for specific prediction tasks (Lam et al., 2018).

% observed .025 .05 .16 .33 .69
Error (RMSE)
Table 2: Loss on the student-course table as we vary the number of observations in the student-prof and course-prof tables. Adding side info in the form of additional tables clearly improves the predictions made in the main table.
Statistical Relational Learning.

Statistical relational learning extends the reach of probabilistic inference to the relational model (Raedt et al., 2016). For example, a variety of lifted inference procedures extend inference methods in graphical models to the relational setting, where in some cases the symmetry group of the model is used to speed up inference (Kersting, 2012)

. Another relevant direction explored in this community includes extensions of logistic regression to relational data 

(Kazemi et al., 2014; Kazemi & Poole, 2017).

Knowledge-Graph Embedding.

An alternative to inference with symbolic representations of relational data is to use embeddings. In particular, Tensor factorization methods that offer tractable inference in latent variable graphical models  (Anandkumar et al., 2014), are extensively used for knowledge-graph embedding (Nickel et al., 2016). A knowledge-graph can be expressed as an ER diagram with a single relation , where representing head and tail entities and is an entity representing the relation. Alternatively, one could think of knowledge-graph as a graph representation for an instantiated ER diagram (as opposed to a set of of tables or tensors). However, in knowledge-graphs, an entity-type is a second class citizen, as it is either another attribute, or it is expressed through relations to special objects representing different “types”.

Relational and Geometric Deep Learning.

Here, we provide a brief overview; see Hamilton et al. (2017b); Battaglia et al. (2018) for a detailed review. Scarselli et al. (2009) introduced a generic framework that iteratively updates node embeddings using neural networks; see also (Li et al., 2015). Gilmer et al. (2017) proposed a similar iterative procedure that updates node embeddings and messages between the neighbouring nodes, and show that it subsumes several other deep models for attributed graphs (Duvenaud et al., 2015; Schütt et al., 2017; Li et al., 2015; Battaglia et al., 2016; Kearnes et al., 2016), including spectral methods that we discuss next. Their method is further generalized in (Kondor et al., 2018b) as well as (Maron et al., 2018)

, which is in turn subsumed in our framework. Spectral methods extend convolution to graphs (and manifolds) using eigenvectors of the Laplacian as the generalization of the Fourier basis 

(Bronstein et al., 2017; Bruna et al., 2014). Simplified variations of this approach leads to an intuitive yet non-maximal parameter-sharing scheme that is widely used in practice Defferrard et al. (2016); Kipf & Welling (2016). This type of simplified graph convolution has also been used for relational reasoning with knowledge-graphs (Schlichtkrull et al., 2018).

Equivariant Deep Models.

An alternative generalization of convolution is defined for functions over groups (Olah, 2014; Cohen & Welling, 2016a), or more generally homogeneous spaces (Cohen et al., 2018b). Moreover, convolution can be performed in the Fourier domain in this setting, where irreducible representations of a group become the Fourier bases (Kondor & Trivedi, 2018). Equivariant deep model design for a variety structured domains is explored in several other recent works (e.g.,  Worrall et al., 2017; Cohen et al., 2018a; Kondor et al., 2018a; Sabour et al., 2017; Weiler et al., 2018); see also (Cohen & Welling, 2016b; Weiler et al., 2017; Kondor et al., 2018b; Anselmi et al., 2019).

Parameter-Sharing, Exchangeability and Equivariance.

The notion of invariance is also studied under the term exchangeability in statistics (Orbanz & Roy, 2015); see also (Benjamin & Whye Teh, 2019) for a probabilistic approach to equivariance. In graphical models exchangeability is often encoded through plate notation, where parameter-sharing happens implicitly.

In the AI community, this relationship between the parameter sharing and “invariance” properties of the network was noticed in the early days of the Perceptron 

(Minsky & Papert, 2017; Shawe-Taylor, 1989, 1993). This was rediscovered in (Ravanbakhsh et al., 2017), where this relation was leveraged for equivariant model design. They show that when the group action is discrete (i.e., is in the form of a permutation) “equivariance” to any group action can be obtained by parameter-sharing. In particular, one of the procedures discussed there ties the elements of based on the orbits of the “joint” action of the group on the rows and columns of . The parameter-tying scheme in the following equivariant networks can be obtained in this way: I. Zaheer et al. (2017) propose an equivariant model for set data. Our model reduces to their parameter-tying when we have a single relation with a single entity – i.e.; i.e., a set of instances; see also Example (2) in Appendix B. II. Hartford et al. (2018) consider a more general setting of interaction across different sets, such as user-tag-movie relations. Our model produces their parameter-sharing when we have a single relation with multiple entities , where all entities appear only once – i.e.. III. Maron et al. (2018) further relax the assumption above, and allow for . Intuitively, this form of relational data can model the interactions within and between sets; for example interaction within nodes of a graph is captured by an adjacency matrix, corresponding to and . This type of parameter-tying is maximal for graphs, and subsumes the parameter-tying approaches derived by simplification of Laplacian-based methods. When restricted to a single relation, our model reduces to the model of (Maron et al., 2018); however, when we have multiple relations, for , our model captures the interaction between different relations / tensors.

7 Discussion and Future Work

We have outlined a novel and principled approach to deep learning with relational data(bases). In particular, we introduced a simple constraint in the form of tied parameters for the standard neural layer and proved that any other tying scheme is either not equivariant wrt exchangeabilities of relational data or can be obtained by further constraining the parameters of our model. The proposed model can be applied in inductive setting where the relational data(base) used during training and test have no overlap.

While our model enjoys a linear computational complexity, we have to overcome one more hurdle before applying this model to large-scale real-world databases, where one must work with sub-samples of the data. However, any such sampling has the effect of sparsifying the observed relations. A careful sampling procedure is required that minimizes this sparsification for a particular subset of entities or relations. While several recent works propose solutions to similar problems on graphs and tensors (e.g.,  Hamilton et al., 2017a; Hartford et al., 2018; Ying et al., 2018; Eksombatchai et al., 2017; Chen et al., 2018; Huang et al., 2018), we leave this important direction for relational databases to future work.


Appendix A Bias Parameters

For full generality, our definition of ERL could also include bias terms without affecting its exchangeability properties. We exclude these in the statements of our main theorems for the sake of simplicity, but discuss their inclusion here for completeness. For each relation , we define a bias tensor . The elements of are tied together in a manner similar to the tying of elements in each : Two elements and are tied together iff , using the definition of equivalence from Eq. 9. Thus, we have a vector of additional free parameters for each relation , where


Consistent with our previous notation, we define , and . Then an ERL with bias terms is given by


The following Claim asserts that we can add this bias term without affecting the desired properties of the ERL.

Claim 2.

If is an ERL, then is an ERL.

The proof (found in Section C.1) argues that, since is an ERL, we just need to show that iff , which holds due to the tying of patterns in each .

Appendix B Simplifications for Models without Self-Relations

In the special case that the multi relations and are sets —i.e., have no self-relations— then the parameter tying scheme of Section 4.1 can be simplified considerably. In this section we address some nice properties of this special setting.

b.0.1 Efficient Implementation Using Subset-Pooling

Due to the particular structure of when all relations contain only unique entities, the operation in the ERL can be implemented using (sum/mean) pooling operations over the tensors for , without any need for vectorization, or for storing directly.

For and , let be the summation of the tensor over the dimensions specified by . That is, where . Then we can write element in the -th block of as


where is the restriction of to only elements indexing entities in . This formulation lends itself to a practical, efficient implementation where we simply compute each term and broadcast-add them back into a tensor of appropriate dimensions.

b.0.2 One-to-One and One-to-Many Relations

In the special case of a one-to-one or one-to-many relations (e.g., in Fig. 1, one professor may teach many courses, but each course has only one professor), we may further reduce the number of parameters due to redundancies. Suppose is some relation, and entity is in a one-to- relation with the remaining entities of . Consider the 1D sub-array of obtained by varying the value of while holding the remaining values fixed. This sub-array contains just a single non-zero entry. According to the tying scheme described in Section 4.1, the parameter block will contain unique parameter values and . Intuitively however, these two parameters capture exactly the same information, since the sub-array obtained by fixing the values of contains exactly the same data as the sub-array obtained by fixing the values of (i.e., the same single value). More concretely, to use the notation of Section B.0.1, we have in Eq. 13, and so we may tie and .

In fact, we can reduce the number of free parameters in the case of self-relations (i.e., relations with non-unique entities) as well in a similar manner.

Appendix C Proofs

Observe that for any index tuple , we can express (Section 4.1) as


c.1 Proof of creftype 2


We want to show that


iff . Since is an ERL, this is equivalent to showing


() Suppose , with defined as in creftype 1. Fix some relation and consider the -th block of :