In the relational model of data, we have a set of entities, and one or more instances of each entity. These instances interact with each other through a set of fixed relations between entities. A set of attributes may be associated with each type of entity and relation.111An alternative terminology refers to instances and entities as entities and entity types. This simple idea is widely used to represent data, often in the form of a relational database, across a variety of domains, from shopping records, social networking data and health records, to heterogeneous data from astronomical surveys.
Learning and inference on relational data has been the topic of research in machine learning for the past decades. The relational model is closely related to first order and predicate logic, where the existence of a relation between instances becomes a truth statement about a world. The same formalism is used in AI through a probabilistic approach to logic, where the field ofstatistical relational learning has fused the relational model with the framework of probabilistic graphical models. Examples of such models include, plate models, probabilistic relational models, Markov logic networks and relational dependency networks (Getoor & Taskar, 2007).
A closely related area that has enjoyed an accelerated growth in recent years is relational and geometric deep learning, where the term “relational” is used to denote the inductive bias introduced by a graph structure (Battaglia et al., 2018). Although within the machine learning community, relational and graph-based terms are often used interchangeably, they could refer to different data structures: graph-based models, such as graph databases (Robinson et al., 2013)
and knowledge graphs(Nickel et al., 2016) simply represent the data as an attributed (hyper-)graph, while the relational model, extensively used in databases, use the entity-relation (ER) diagram (Chen, 1976) to constrain the relations of each instance (corresponding to a node), based on its entity-type; see Fig. 1(a) and related works Section 6.
Here, we use an alternative inductive bias, namely invariance and equivariance, to encode the structure of relational data. This type of bias, informs a model’s behaviour under various transformations, and it is built on group theory rather than graph theory. Equivariant models have been successfully used for deep learning with variety of topologically distinct structures, from images and graphs to sets and spheres. By adopting this perspective, we present a maximally expressive neural layer that achieves equivariance wrt exchangeabilities in relational data. Our model generalizes recently proposed models for sets (Zaheer et al., 2017), exchangeable tensors (Hartford et al., 2018), as well as equivariant models for attributed hyper-graphs (Maron et al., 2018).
2 Representations of Relational Data
To formalize the relational model, we have a set of entities, and a set of instances of each entity . These entities interact with each other through a set of relations . For each relation , where , we observe data in the form of a set of tuples . Note that the singleton relation can be used to incorporate individual entity attributes (such as professors’ evaluations in Fig. 1-a). For a full list of notation used in this paper, see Table 1.
In the most general case, we allow for both , and any to be multisets (i.e., to contain duplicate entries). is a multiset if we have multiple relations between the same set of entities. For example, we may have a supervises relation between students and professors, in addition to the writes reference relation. A particular relation is a multiset if it contains multiple copies of the same entity. Such relations are ubiquitous in the real world, describing for example, the connection graph of a social network, the sale/purchase relationships between a group of companies, or, in our running example, the course-course relation capturing prerequisite information.
For our initial derivations we make the simplifying assumption that each attribute is a scalar. Later, we extend this to . In general, this attribute can be a complex object, such as text, image or video, encoded (decoded) to (from) a feature-vector, using a deep model, in and end-to-end training. Another common feature of relational data is the one-to-many relation, addressed in Appendix B.
2.1 Tuples, Tables and Tensors
In relational databases the set of tuples is often represented using a table, with one row for each tuple; see Fig. 1(b). An equivalent representation for is using a “sparse” -dimensional tensor , where each dimension of this tensor corresponds to an entity , and the the length of that dimension is the number of instances . In other words
We work with this tensor representation of relational data. We use to denote the set of all sparse tensors that define our relational data(base); see Fig. 1(c). For the following discussions around exchangeability and equivariance, w.l.o.g. we assume that for all , are fully observed, dense tensors. Subsequently, we will discard this assumption and attempt to make predictions for (any subset of) the missing records.
Because we allow a relation to contain the same entities multiple times, we formally define a multiset as a tuple , where is a set, and maps elements of to their multiset counts. We will call the elements of the multiset , and the count of element . We define the union and intersection of two multisets and as
In general, we may also refer to a multiset using typical set notation (e.g., ). We will use bracketed superscripts to distinguish distinct but equal members of any multiset (e.g., ). The ordering of equal members is specified by context or arbitrarily. The size of a multiset accounts for multiplicities: .
3 Exchangeabilities of Relational Data(bases)
Recall that in the representation , each entity has a set of instances indexed by . However, this ordering is arbitrary and we can shuffle these instances, affecting only the representation, and not the “content” of our relational data. However, in order to maintain consistency across data tables, we also have to shuffle all the tensors , where , using the same permutation applied to the tensor dimension corresponding to . At a high level, this simple indifference to shuffling defines the exchangeabilities of relational data. A mathematical group formalizes this idea.
A mathematical group is a set equipped with a binary operation between its members, such that the set and the operation satisfy closure, associativity, invertability and existence of a unique identity element. refers to the symmetric group, the group of “all” permutations of objects. A natural representation for a member of this group , is a permutation matrix . Here, the binary group operation is the same as the product of permutation matrices. In this notation, is the group of all permutations of instances of entity . To consider permutations to multiple dimensions of a data tensor we can use the direct product of groups. Given two groups and , the direct product is defined by
That is, the underlying set is the Cartesian product of the underlying sets of and , and the group operation is the component-wise operation.
Observe that we can associate the group with a relational model with entities, where each entity has instances. Intuitively, applying permutations from this group to the corresponding relational data should not affect the underlying contents, while applying permutations from outside this group should. To see this, consider the tensor representation of Fig. 1(c): permuting students, courses or professors shuffles rows or columns of , but preserves its underlying content. However, arbitrary shuffling of the elements of this tensor could alter its content.
Our goal is to define a neural network layer that is “aware” of this structure. For this, we first need to formalize the action of on the vectorized form of .
For each tensor , refers to the total number of elements of tensor (note that for now we are assuming that the tensors are dense). We will refer to as the number of elements of . Then refers to the vectorization of , obtained by successively stacking its elements along its dimensions, where the order of dimensions is given by . We use to refer to the inverse operation of , so that . With a slight abuse of notation, we use to refer to , the vector created by stacking all of the ’s in a column according to a fixed ordering of the relations. The layer that we design later is also applied to this vectorized form of the relational data.
The action of on , permutes the elements of . Our objective is to define this group action by mapping to a group of permutations of objects – i.e., a subgroup of . To this end we need to use two types of matrix product.
Let and be two matrices. The direct sum is an block-diagonal matrix
and the Kronecker product is an matrix
Note that in the special case that both and are permutation matrices, and will also be permutation matrices.
Both of these matrix operations can represent the direct product of permutation groups. That is, given two permutation matrices , and , we can use both and to represent members of . However, the resulting permutation matrices, can be interpreted as different actions: while the direct sum matrix is a permutation of objects, the Kronecker product matrix is a permutation of objects.
Consider the vectorized relational data of length . The action of on so that the content is preserved, is given by the following permutation group
where the order of relations in is consistent with the ordering used for vectorization of .
The Kronecker product when applied to , permutes the underlying tensor along the axes . Using direct sum, these permutations are applied to each tensor in . The only constraint, enforced by Eq. 5 is to use the same permutation matrix for all when . Therefore any matrix-vector product is a “legal” permutation of , since it only shuffles the instances of each entity. ∎
4 Equivariant Relational Layer (ERL)
Our objective is to design a neural layer —where are the number of input and output channels, and — such that any “legal” transformation of input –as defined in Eq. 5– should result in the same transformation of the output. For clarity, we limit the following definition to the case where , and extend it to multiple channels in Section 4.3.
Definition 1 (Equivariant Relational Layer; ERL).
Let be any permutation of . A fully connected layer with is called an Equivariant Relational Layer if,
That is, an ERL is a layer that commutes with the permutation if and only if is a legal permutaiton, as defined by Eq. 5.
We now propose a procedure to tie the entries of so as to guarantee the conditions of Definition 1. Moreover, we show that the proposed model is the most expressive form of parameter-sharing with this property.
We build up block-wise, with blocks corresponding to pairs of relations :
Here, the parameters are tied only within each block.
To concisely express this complex tying scheme, we use the following indexing notation. In its most general form we can allow for particular form of bias parameters in Definition 1; see Appendix A for parameter-tying in the bias.
The parameter block is an matrix, where . Given the relation , we use the tuple to index an element in the set . Therefore, can be used as an index for both data block and the rows of parameter block . In particular, to denote an entry of , we use . Moreover, we use to denote the element of corresponding to entity . Note that this is not necessarily the element of the tuple . For example, if and , then and . When is a multiset, we can use to refer the to the element of corresponding to the -th occurrence of entity (where the order corresponds to the ordering of elements in ).
4.1 Parameter Tying
Let and denote two arbitrary elements of the parameter matrix . Our objective is to decide whether or not they should be tied together. For this we define an equivalence relation between index tuples , where is a concatenation of and . All the entries of with “equivalent” indices are tied together. This equivalence relation is based on the equality patterns within and . To study the equality pattern in , we should consider the equality pattern in each sub-tuple , the restriction of to only indices over entity , for each . This is because we can only meaningfully compare indices of the same entity – e.g., we can compare two student indices for equality, but we cannot compare a student index with a course index. We can further partition , into sub-partitions such that two of its indices are in the same partition iff they are equal:
Using this notation, we say iff they have the same equality patterns —i.e., the same partitioning of indices for all :
This means that the total number of free parameters in is the product of the number of possible different partitionings for each entity :
where is the free parameter vector associated with , and is the Bell number that counts the possible partitionings of a set of size ; growth of number of parameters with Bell number was previously shown for equivariant graph networks (Maron et al., 2018), which as we see in Section 6 are closely related, and indeed a special case of our model.
[Fig. 2] To get an intuition for this tying scheme, consider a simplified version of Fig. 1 restricted to three relations , self-relation , and with students, courses, and professors. Then , so and will have nine blocks: and so on.
We use tuple to index the rows and columns of . We also use to index the rows of , and use to index its columns. Other blocks are indexed similarly.
The elements of take different values, depending on whether or not and , for row index and column index . The elements of take different values: The index can only be partitioned in a single way (). However index and indices and all index into the courses table, and so can each potentially refer to the same course. We thus have a unique parameter for each possible combination of equalities between these three items, giving us a factor of different parameter values; see Fig. 2(a), is the upper left block, and is the block to its right.
The center block of Fig. 2(a), produces the effect of on itself. Here, all four index values could refer to the same course, and so there are different parameters.
This parameter-sharing scheme admits a simple recursive form, if the database has no self-relations; see Appendix B.
4.1.1 Achieving ERL with parameter tying
In this section we relate the notion of group-action that led to the definition of an ideal Equivariant Relational Layer (Definition 1), to the parameter-sharing scheme that produced . Starting with the following lemma, we show that our parameter-sharing produces ERL, and any relaxation of it (by untying the parameters) fails the ERL requirements.
For any permutation matrices and we have
for any choice of . That is and separately permute the instances of each entity in the multisets and , applying the same permutation to any duplicated entities, as well as to any entities common to both and .
For intuition and the proof of this lemma see Appendix C. See Fig. 2 for a minimal example, demonstrating this lemma. We are now prepared to state our two main Theorems.
See Appendix C for a proof. This theorem assures us that a layer constructed with our parameter-sharing scheme achieves equivariace w.r.t. the exchangeabilities of the relational data. However, one may wonder whether an alternative, more expressive parameter-sharing scheme may be possible. The following theorem proves that this is not the case, and that our model is the most expressive parameter-tying scheme possible for an ERL. [style=MyFrame2]
4.2 Sparse Tensors
So far, for simplicity we assumed that the tensors are dense. In practice, we often observe a small portion of entries of these tensors. The sparsity does not affect the equivariance properties of the proposed parameter-sharing. In practice, not only do we use a sparse representation of the input tensors, but also produce sparse tensors as the output of the ERL, without compromising its desirable equivariance properties. Using sparse input and output, as well as using pooling operations that reduces the complexity of to “linear” in the number of non-zeros in , make ERL relatively practical; see Appendix B for pooling-based implementation. However, note that still the layer must receive the whole database as its input, and further subsampling techniques are required to apply this model to larger real-world databases.
4.3 Multiple Layers and Channels
Equivariance is maintained by composition of equivariant functions. This allows us to stack ERL to build “deep” models that operate on relational data(bases). Using multiple input () and output () channels is also possible by replacing the parameter matrix , with the parameter tensor ; while copies have the same parameter-tying pattern —i.e., there is no parameter-sharing “across” channels. The single-channel matrix-vector product in where is now replaced with contraction of two tensors , for .
Our experiments study the viability of ERLs for deep embedding, prediction of missing records, and inductive reasoning. See Appendix D for details as well as an additional real-world experiment where we predict the outcome of a soccer match from historical records and player information.
To continue with our running example we synthesize a toy dataset, restricted to (student-course), (student-professor), and (professor-course). Each matrix in the relational database, , is produced by first uniformly sampling an h-dimensional embedding for each entity instance , followed by matrix product . A sparse subset of these matrices are observed by our model in the following experiments. Note that while we use a simple matrix product to generate the content of tables from latent factors, the model is oblivious to this generative process.
We use a factorized auto-encoding architecture consisting of a stack of ERLs followed by pooling that produces code matrices for each entity, student, course and professor. The code is then fed to a decoding stack of ERLs to reconstruct the sparse .
We use a small embedding dimension for visualization. We use , and the model observes of database entries. Fig. 3(a) visualizes the ground-truth
versus the estimated embeddingfor students; see Appendix D for more figures. The figure suggests the produced embedding agrees with the ground truth (note that in the best case, the two embeddings agree up to a diffeomorphism).
Missing Record Prediction.
Here, we set out to predict missing values in student-course table using the observations across the whole database. For this, the factorized auto-encoding architecture is trained to only minimize the reconstruction error for “observed” entries in student-course tensor.
We use , with an embedding size of 10 and continuously change two quantities: 1) the percentage of observed entries (i.e., sparsity of all database tensors) at training time; and 2) once the model is trained, we vary the sparsity of test time observations. Figure 4(a) visualizes the average prediction error over 5 runs as a function of these two quantities for the student-course table. The figure confirms our expectation that the amount of observations during both training and test can increase the prediction accuracy. More importantly, once the model is trained at a particular sparsity level, it shows robustness to the changes in the sparsity level during test time.
Predictive Value of Side-Information.
Our previous experiments raise a question as to whether we gain anything by using the “entire” database for predicting missing entries of a particular table? That is, we could simply make predictions using only the target tensor ( student-course table) for both training and testing. To answer this question, we fix the sparsity level of the student-course table at .1, and train models with increasing levels of sparsity for the side information. That is, we vary the sparsity of the tensors and in the range . Table 2 shows that our holistic approach to predictive analysis is indeed advantageous: side information in the form of student-professor and course-professor tables can improve the prediction of missing records in the student-course table.
Once we finish training the model on a relational dataset, we can apply it to another instantiation — that is a dataset with completely different students, courses and professors. Fig. 3(b) reproduces the embedding experiment, and Fig. 4
(b) shows the missing record prediction results for the inductive setting. In both cases the model performs reasonably well when applied to a new database. This setting can have interesting real-world applications, as it enables transfer learning across databases and allows for predictive analysis without training for new entities in a database as they become available.
6 Related Literature
Here, we briefly review the related literature in different areas. To our knowledge there are no similar frameworks for direct application of deep models to relational databases, and current practice is to automate feature-engineering for specific prediction tasks (Lam et al., 2018).
Statistical Relational Learning.
Statistical relational learning extends the reach of probabilistic inference to the relational model (Raedt et al., 2016). For example, a variety of lifted inference procedures extend inference methods in graphical models to the relational setting, where in some cases the symmetry group of the model is used to speed up inference (Kersting, 2012)
. Another relevant direction explored in this community includes extensions of logistic regression to relational data(Kazemi et al., 2014; Kazemi & Poole, 2017).
An alternative to inference with symbolic representations of relational data is to use embeddings. In particular, Tensor factorization methods that offer tractable inference in latent variable graphical models (Anandkumar et al., 2014), are extensively used for knowledge-graph embedding (Nickel et al., 2016). A knowledge-graph can be expressed as an ER diagram with a single relation , where representing head and tail entities and is an entity representing the relation. Alternatively, one could think of knowledge-graph as a graph representation for an instantiated ER diagram (as opposed to a set of of tables or tensors). However, in knowledge-graphs, an entity-type is a second class citizen, as it is either another attribute, or it is expressed through relations to special objects representing different “types”.
Relational and Geometric Deep Learning.
Here, we provide a brief overview; see Hamilton et al. (2017b); Battaglia et al. (2018) for a detailed review. Scarselli et al. (2009) introduced a generic framework that iteratively updates node embeddings using neural networks; see also (Li et al., 2015). Gilmer et al. (2017) proposed a similar iterative procedure that updates node embeddings and messages between the neighbouring nodes, and show that it subsumes several other deep models for attributed graphs (Duvenaud et al., 2015; Schütt et al., 2017; Li et al., 2015; Battaglia et al., 2016; Kearnes et al., 2016), including spectral methods that we discuss next. Their method is further generalized in (Kondor et al., 2018b) as well as (Maron et al., 2018)
, which is in turn subsumed in our framework. Spectral methods extend convolution to graphs (and manifolds) using eigenvectors of the Laplacian as the generalization of the Fourier basis(Bronstein et al., 2017; Bruna et al., 2014). Simplified variations of this approach leads to an intuitive yet non-maximal parameter-sharing scheme that is widely used in practice Defferrard et al. (2016); Kipf & Welling (2016). This type of simplified graph convolution has also been used for relational reasoning with knowledge-graphs (Schlichtkrull et al., 2018).
Equivariant Deep Models.
An alternative generalization of convolution is defined for functions over groups (Olah, 2014; Cohen & Welling, 2016a), or more generally homogeneous spaces (Cohen et al., 2018b). Moreover, convolution can be performed in the Fourier domain in this setting, where irreducible representations of a group become the Fourier bases (Kondor & Trivedi, 2018). Equivariant deep model design for a variety structured domains is explored in several other recent works (e.g., Worrall et al., 2017; Cohen et al., 2018a; Kondor et al., 2018a; Sabour et al., 2017; Weiler et al., 2018); see also (Cohen & Welling, 2016b; Weiler et al., 2017; Kondor et al., 2018b; Anselmi et al., 2019).
Parameter-Sharing, Exchangeability and Equivariance.
The notion of invariance is also studied under the term exchangeability in statistics (Orbanz & Roy, 2015); see also (Benjamin & Whye Teh, 2019) for a probabilistic approach to equivariance. In graphical models exchangeability is often encoded through plate notation, where parameter-sharing happens implicitly.
In the AI community, this relationship between the parameter sharing and “invariance” properties of the network was noticed in the early days of the Perceptron(Minsky & Papert, 2017; Shawe-Taylor, 1989, 1993). This was rediscovered in (Ravanbakhsh et al., 2017), where this relation was leveraged for equivariant model design. They show that when the group action is discrete (i.e., is in the form of a permutation) “equivariance” to any group action can be obtained by parameter-sharing. In particular, one of the procedures discussed there ties the elements of based on the orbits of the “joint” action of the group on the rows and columns of . The parameter-tying scheme in the following equivariant networks can be obtained in this way: I. Zaheer et al. (2017) propose an equivariant model for set data. Our model reduces to their parameter-tying when we have a single relation with a single entity – i.e., ; i.e., a set of instances; see also Example (2) in Appendix B. II. Hartford et al. (2018) consider a more general setting of interaction across different sets, such as user-tag-movie relations. Our model produces their parameter-sharing when we have a single relation with multiple entities , where all entities appear only once – i.e., . III. Maron et al. (2018) further relax the assumption above, and allow for . Intuitively, this form of relational data can model the interactions within and between sets; for example interaction within nodes of a graph is captured by an adjacency matrix, corresponding to and . This type of parameter-tying is maximal for graphs, and subsumes the parameter-tying approaches derived by simplification of Laplacian-based methods. When restricted to a single relation, our model reduces to the model of (Maron et al., 2018); however, when we have multiple relations, for , our model captures the interaction between different relations / tensors.
7 Discussion and Future Work
We have outlined a novel and principled approach to deep learning with relational data(bases). In particular, we introduced a simple constraint in the form of tied parameters for the standard neural layer and proved that any other tying scheme is either not equivariant wrt exchangeabilities of relational data or can be obtained by further constraining the parameters of our model. The proposed model can be applied in inductive setting where the relational data(base) used during training and test have no overlap.
While our model enjoys a linear computational complexity, we have to overcome one more hurdle before applying this model to large-scale real-world databases, where one must work with sub-samples of the data. However, any such sampling has the effect of sparsifying the observed relations. A careful sampling procedure is required that minimizes this sparsification for a particular subset of entities or relations. While several recent works propose solutions to similar problems on graphs and tensors (e.g., Hamilton et al., 2017a; Hartford et al., 2018; Ying et al., 2018; Eksombatchai et al., 2017; Chen et al., 2018; Huang et al., 2018), we leave this important direction for relational databases to future work.
- Anandkumar et al. (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research, 15(1):2773–2832, 2014.
- Anselmi et al. (2019) Anselmi, F., Evangelopoulos, G., Rosasco, L., and Poggio, T. Symmetry-adapted representation learning. Pattern Recognition, 86:201–208, 2019.
- Battaglia et al. (2016) Battaglia, P., Pascanu, R., Lai, M., Rezende, D. J., et al. Interaction networks for learning about objects, relations and physics. In Advances in neural information processing systems, pp. 4502–4510, 2016.
- Battaglia et al. (2018) Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
- Benjamin & Whye Teh (2019) Benjamin, B.-R. and Whye Teh, Y. Probabilistic symmetry and invariant neural networks. arXiv preprint, arXiv:1901.06082, 2019.
- Bronstein et al. (2017) Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
- Bruna et al. (2014) Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral networks and locally connected networks on graphs. ICLR, 2014.
- Chen et al. (2018) Chen, J., Ma, T., and Xiao, C. Fastgcn: Fast learning with graph convolutional networks via importance sampling. CoRR, abs/1801.10247, 2018.
- Chen (1976) Chen, P. P.-S. The entity-relationship model—toward a unified view of data. ACM Transactions on Database Systems (TODS), 1(1):9–36, 1976.
- Cohen & Welling (2016a) Cohen, T. S. and Welling, M. Group equivariant convolutional networks. arXiv preprint arXiv:1602.07576, 2016a.
- Cohen & Welling (2016b) Cohen, T. S. and Welling, M. Steerable cnns. arXiv preprint arXiv:1612.08498, 2016b.
- Cohen et al. (2018a) Cohen, T. S., Geiger, M., Köhler, J., and Welling, M. Spherical cnns. arXiv preprint arXiv:1801.10130, 2018a.
- Cohen et al. (2018b) Cohen, T. S., Geiger, M., and Weiler, M. Intertwiners between induced representations (with applications to the theory of equivariant neural networks). arXiv preprint arXiv:1803.10743, 2018b.
- Defferrard et al. (2016) Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pp. 3844–3852, 2016.
- Duvenaud et al. (2015) Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, 2015.
- Eksombatchai et al. (2017) Eksombatchai, C., Jindal, P., Liu, J. Z., Liu, Y., Sharma, R., Sugnet, C., Ulrich, M., and Leskovec, J. Pixie: A system for recommending 3+ billion items to 200+ million users in real-time. CoRR, abs/1711.07601, 2017.
- Getoor & Taskar (2007) Getoor, L. and Taskar, B. Introduction to statistical relational learning. MIT press, 2007.
- Gilmer et al. (2017) Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212, 2017.
- Hamilton et al. (2017a) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034, 2017a.
- Hamilton et al. (2017b) Hamilton, W. L., Ying, R., and Leskovec, J. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017b.
- Hartford et al. (2018) Hartford, J., Graham, D. R., Leyton-Brown, K., and Ravanbakhsh, S. Deep models of interactions across sets. In Proceedings of the 35th International Conference on Machine Learning, pp. 1909–1918, 2018.
- Huang et al. (2018) Huang, W., Zhang, T., Rong, Y., and Huang, J. Adaptive sampling towards fast graph representation learning. In Advances in Neural Information Processing Systems 31, pp. 4559–4568. 2018.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Kazemi & Poole (2017) Kazemi, S. M. and Poole, D. Relnn: a deep neural model for relational learning. arXiv preprint arXiv:1712.02831, 2017.
- Kazemi et al. (2014) Kazemi, S. M., Buchman, D., Kersting, K., Natarajan, S., and Poole, D. Relational logistic regression. In KR. Vienna, 2014.
- Kearnes et al. (2016) Kearnes, S., McCloskey, K., Berndl, M., Pande, V., and Riley, P. Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8):595–608, 2016.
- Kersting (2012) Kersting, K. Lifted probabilistic inference. In ECAI, pp. 33–38, 2012.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Kipf & Welling (2016) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Kondor & Trivedi (2018) Kondor, R. and Trivedi, S. On the generalization of equivariance and convolution in neural networks to the action of compact groups. arXiv preprint arXiv:1802.03690, 2018.
- Kondor et al. (2018a) Kondor, R., Lin, Z., and Trivedi, S. Clebsch–gordan nets: a fully fourier space spherical convolutional neural network. In Advances in Neural Information Processing Systems, pp. 10137–10146, 2018a.
- Kondor et al. (2018b) Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, S. Covariant compositional networks for learning graphs. arXiv preprint arXiv:1801.02144, 2018b.
- Lam et al. (2018) Lam, H. T., Minh, T. N., Sinn, M., Buesser, B., and Wistuba, M. Learning features for relational data. arXiv preprint arXiv:1801.05372, 2018.
- Li et al. (2015) Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
- Maron et al. (2018) Maron, H., Ben-Hamu, H., Shamir, N., and Lipman, Y. Invariant and equivariant graph networks. arXiv preprint arXiv:1812.09902, 2018.
- Minsky & Papert (2017) Minsky, M. and Papert, S. A. Perceptrons: An introduction to computational geometry. MIT press, 2017.
- Nickel et al. (2016) Nickel, M., Murphy, K., Tresp, V., and Gabrilovich, E. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016.
- Olah (2014) Olah, C. Groups and group convolutions, 2014.
- Orbanz & Roy (2015) Orbanz, P. and Roy, D. M. Bayesian models of graphs, arrays and other exchangeable random structures. IEEE transactions on pattern analysis and machine intelligence, 37(2):437–461, 2015.
- Raedt et al. (2016) Raedt, L. D., Kersting, K., Natarajan, S., and Poole, D. Synthesis Lectures on Artificial Intelligence and Machine Learning, 10(2):1–189, 2016.
- Ravanbakhsh et al. (2017) Ravanbakhsh, S., Schneider, J., and Poczos, B. Equivariance through parameter-sharing. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of JMLR: WCP, August 2017.
- Robinson et al. (2013) Robinson, I., Webber, J., and Eifrem, E. Graph databases. " O’Reilly Media, Inc.", 2013.
- Sabour et al. (2017) Sabour, S., Frosst, N., and Hinton, G. E. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3856–3866, 2017.
- Scarselli et al. (2009) Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
- Schlichtkrull et al. (2018) Schlichtkrull, M., Kipf, T. N., Bloem, P., van den Berg, R., Titov, I., and Welling, M. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Springer, 2018.
- Schütt et al. (2017) Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R., and Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. Nature communications, 8:13890, 2017.
- Shawe-Taylor (1989) Shawe-Taylor, J. Building symmetries into feedforward networks. In Artificial Neural Networks, 1989., First IEE International Conference on (Conf. Publ. No. 313), pp. 158–162. IET, 1989.
- Shawe-Taylor (1993) Shawe-Taylor, J. Symmetries and discriminability in feedforward network architectures. IEEE Transactions on Neural Networks, 4(5):816–826, 1993.
- Weiler et al. (2017) Weiler, M., Hamprecht, F. A., and Storath, M. Learning steerable filters for rotation equivariant cnns. arXiv preprint arXiv:1711.07289, 2017.
- Weiler et al. (2018) Weiler, M., Boomsma, W., Geiger, M., Welling, M., and Cohen, T. 3d steerable cnns: Learning rotationally equivariant features in volumetric data. In Advances in Neural Information Processing Systems, pp. 10401–10412, 2018.
Worrall et al. (2017)
Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and Brostow, G. J.
Harmonic networks: Deep translation and rotation equivariance.
Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.
- Ying et al. (2018) Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., and Leskovec, J. Graph convolutional neural networks for web-scale recommender systems. arXiv preprint arXiv:1806.01973, 2018.
- Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, 2017.
Appendix A Bias Parameters
For full generality, our definition of ERL could also include bias terms without affecting its exchangeability properties. We exclude these in the statements of our main theorems for the sake of simplicity, but discuss their inclusion here for completeness. For each relation , we define a bias tensor . The elements of are tied together in a manner similar to the tying of elements in each : Two elements and are tied together iff , using the definition of equivalence from Eq. 9. Thus, we have a vector of additional free parameters for each relation , where
Consistent with our previous notation, we define , and . Then an ERL with bias terms is given by
The following Claim asserts that we can add this bias term without affecting the desired properties of the ERL.
If is an ERL, then is an ERL.
The proof (found in Section C.1) argues that, since is an ERL, we just need to show that iff , which holds due to the tying of patterns in each .
Appendix B Simplifications for Models without Self-Relations
In the special case that the multi relations and are sets —i.e., have no self-relations— then the parameter tying scheme of Section 4.1 can be simplified considerably. In this section we address some nice properties of this special setting.
b.0.1 Efficient Implementation Using Subset-Pooling
Due to the particular structure of when all relations contain only unique entities, the operation in the ERL can be implemented using (sum/mean) pooling operations over the tensors for , without any need for vectorization, or for storing directly.
For and , let be the summation of the tensor over the dimensions specified by . That is, where . Then we can write element in the -th block of as
where is the restriction of to only elements indexing entities in . This formulation lends itself to a practical, efficient implementation where we simply compute each term and broadcast-add them back into a tensor of appropriate dimensions.
b.0.2 One-to-One and One-to-Many Relations
In the special case of a one-to-one or one-to-many relations (e.g., in Fig. 1, one professor may teach many courses, but each course has only one professor), we may further reduce the number of parameters due to redundancies. Suppose is some relation, and entity is in a one-to- relation with the remaining entities of . Consider the 1D sub-array of obtained by varying the value of while holding the remaining values fixed. This sub-array contains just a single non-zero entry. According to the tying scheme described in Section 4.1, the parameter block will contain unique parameter values and . Intuitively however, these two parameters capture exactly the same information, since the sub-array obtained by fixing the values of contains exactly the same data as the sub-array obtained by fixing the values of (i.e., the same single value). More concretely, to use the notation of Section B.0.1, we have in Eq. 13, and so we may tie and .
In fact, we can reduce the number of free parameters in the case of self-relations (i.e., relations with non-unique entities) as well in a similar manner.
Appendix C Proofs
Observe that for any index tuple , we can express (Section 4.1) as
c.1 Proof of creftype 2
We want to show that
iff . Since is an ERL, this is equivalent to showing
() Suppose , with defined as in creftype 1. Fix some relation and consider the -th block of :