Log In Sign Up

Ensemble Multi-Relational Graph Neural Networks

by   Yuling Wang, et al.

It is well established that graph neural networks (GNNs) can be interpreted and designed from the perspective of optimization objective. With this clear optimization objective, the deduced GNNs architecture has sound theoretical foundation, which is able to flexibly remedy the weakness of GNNs. However, this optimization objective is only proved for GNNs with single-relational graph. Can we infer a new type of GNNs for multi-relational graphs by extending this optimization objective, so as to simultaneously solve the issues in previous multi-relational GNNs, e.g., over-parameterization? In this paper, we propose a novel ensemble multi-relational GNNs by designing an ensemble multi-relational (EMR) optimization objective. This EMR optimization objective is able to derive an iterative updating rule, which can be formalized as an ensemble message passing (EnMP) layer with multi-relations. We further analyze the nice properties of EnMP layer, e.g., the relationship with multi-relational personalized PageRank. Finally, a new multi-relational GNNs which well alleviate the over-smoothing and over-parameterization issues are proposed. Extensive experiments conducted on four benchmark datasets well demonstrate the effectiveness of the proposed model.


Elastic Graph Neural Networks

While many existing graph neural networks (GNNs) have been proven to per...

Interpreting and Unifying Graph Neural Networks with An Optimization Framework

Graph Neural Networks (GNNs) have received considerable attention on gra...

Graph Neural Networks with Generated Parameters for Relation Extraction

Recently, progress has been made towards improving relational reasoning ...

Relational State-Space Model for Stochastic Multi-Object Systems

Real-world dynamical systems often consist of multiple stochastic subsys...

Are Graph Neural Networks Really Helpful for Knowledge Graph Completion?

Knowledge graphs (KGs) facilitate a wide variety of applications due to ...

Revisiting Over-smoothing and Over-squashing using Ollivier's Ricci Curvature

Graph Neural Networks (GNNs) had been demonstrated to be inherently susc...

Evaluating Logical Generalization in Graph Neural Networks

Recent research has highlighted the role of relational inductive biases ...

1 Introduction

Graph neural networks (GNNs), which have been applied to a large range of downstream tasks, have displayed superior performance on dealing with graph data within recent years, e.g., biological networks [Huang et al.2020a]

and knowledge graphs 

[Yu et al.2021]. Generally, the current GNN architecture follows the message passing frameworks, where the propagation process is the key component. For example, GCN [Kipf and Welling2016] directly aggregates and propagates transformed features along the topology at each layer. PPNP [Klicpera et al.2018] aggregates both of the transformed features and the original features at each layer. JKNet [Xu et al.2018b]

selectively combines the aggregated messages from different layers via concatenation/max-pooling/attention operations.

Recent studies [Zhu et al.2021, Ma et al.2021] have proven that despite different propagation processes of various GNNs, they usually can be fundamentally unified as an optimization objective containing a feature fitting term and a graph regularization term as follows:


where is the original input feature and is the graph Laplacian matrix encoding the graph structure. is the propagated representation, and , are defined as arbitrary graph convolutional kernels and usually set as . This optimization objective reveals a mathematical guideline that essentially governs the propagation mechanism, and opens a new path to design novel GNNs. That is, such clear optimization objective is able to derive the corresponding propagation process, further making the designed GNN architecture more interpretable and reliable [Zhu et al.2021, Liu et al.2021, Yang et al.2021]. For example, [Zhu et al.2021] replaces and with high-pass kernel and infers new high-pass GNNs;  [Liu et al.2021] applies norm to term and infers Elastic GNNs.

Despite the great potential of this optimization objective on designing GNNs, it is well recognized that it is only proposed for traditional homogeneous graphs, rather than the multi-relational graphs with multiple types of relations. However, in real-world applications, multi-relational graphs tend to be more general and pervasive in many areas. For instance, the various types of chemical bonds in molecular graphs, and the diverse relationships between people in social networks. Therefore, it is greatly desired to design GNN models that are able to adapt to multi-relational graphs. Some literatures have been devoted to the multi-relational GNNs, which can be roughly categorized into feature mapping based approaches [Schlichtkrull et al.2018] and learning relation embeddings based approaches [Vashishth et al.2019]

. However, these methods usually design the propagation process heuristically without a clear and an explicit mathematical objective. Despite they improve the performance, they still suffer from the problems of over-parameterization 

[Vashishth et al.2019] and over-smoothing [Oono and Suzuki2019].

“Can we remedy the original optimization objective to design a new type of multi-relational GNNs that is more reliable with solid objective, and at the same time, alleviates the weakness of current multi-relational GNNs, e.g., over-smoothing and over-parameterization?”

Nevertheless, it is technically challenging to achieve this goal. Firstly, how to incorporate multiple relations into an optimization objective. Different relations play different roles, and we need to distinguish them in this optimization objective as well. Secondly, to satisfy the above requirements, it is inevitable that the optimization objective will become more complex, maybe with more constrains. How to derive the underlying message passing mechanism by optimizing the objective is another challenge. Thirdly, even with the message passing mechanism, it is highly desired that how to integrate it into deep neural networks via simple operations without introducing excessive parameters.

In this paper, we propose a novel multi-relational GNNs by designing an ensemble optimization objective. In particular, our proposed ensemble optimization objective consists of a feature fitting term and an ensemble multi-relational graph regularization (EMR) term. Then we derive an iterative optimization algorithm with this ensemble optimization objective to learn the node representation and the relational coefficients as well. We further show that this iterative optimization algorithm can be formalized as an ensemble message passing layer, which has a nice relationship with multi-relational personalized PageRank and covers some existing propagation processes. Finally, we integrate the derived ensemble message passing layer into deep neural networks by decoupling the feature transformation and message passing process, and a novel family of multi-relational GNN architectures is developed. Our key contributions can be summarized as follows:

  • We make the first effort on how to derive multi-relational GNNs from the perspective of optimization framework, so as to enable the derived multi-relational GNNs more reliable. This research holds great potential for opening new path to design multi-relational GNNs.

  • We propose a new optimization objective for multi-relational graphs, and we derive a novel ensemble message passing (EnMP) layer. A new family of multi-relational GNNs is then proposed in a decoupled way.

  • We build the relationships between our proposed EnMP layer with multi-relational personalized PageRank, and some current message passing layers. Moreover, our proposed multi-relational GNNs can well alleviate the over-smoothing and over-parameterazion issues.

  • Extensive experiments are conducted, which comprehensively demonstrate the effectiveness of our proposed multi-relational GNNs.

2 Related Work

Graph Neural Networks. The dominant paradigms of GNNs can be generally summarized into two branches: spectral-based GNNs [Defferrard et al.2016, Klicpera et al.2018] and spatial-based GNNs [Gilmer et al.2017, Klicpera et al.2018]. Various of representative GNNs have been proposed by designing different information aggregation and update strategies along topologies, e.g., [Gilmer et al.2017, Klicpera et al.2018]. Recent works [Zhu et al.2021, Ma et al.2021] have explore the intrinsically unified optimization framework behind existing GNNs.

Multi-relational Graph Neural Networks. The core idea of multi-relational GNNs [Schlichtkrull et al.2018, Vashishth et al.2019, Thanapalasingam et al.2021] is to encode relational graph structure information into low-dimensional node or relation embeddings. As a representative relational GNNs, RGCN [Schlichtkrull et al.2018] designs a specific convolution for each relation, and then the convolution results under all relations are aggregated, these excess parameters generated are completely learned in an end-to-end manner. Another line of literature [Ji et al.2021a, Wang et al.2019, Fu et al.2020, Yun et al.2019] considers the heterogeneity of edges and nodes to construct meta-paths, then aggregate messages from different meta-path based neighbors.

3 Proposed Method

Notations. Consider a multi-relational graph with nodes and labeled edges (relations) , where is a relation type. Graph structure under relation can be described by the adjacency matrix , where if there is an edge between nodes and under relation , otherwise . The diagonal degree matrix is denoted as , where . We use to represent the adjacency matrix with added self-loop and . Then the normalized adjacency matrix is . Correspondingly, is the normalized symmetric positive semi-definite graph Laplacian matrix of relation .

3.1 Ensemble Optimization Framework

Given a multi-relational graph, one basic requirement is that the learned representation should capture the homophily property in the graph with relation , i.e., the representations and should be similar if nodes and are connected by relation . We can achieve the above goal by minimizing to the following term with respect to :


where represents the node and node are connected under relation .

With all the relations, we need to simultaneously capture the graph signal smoothness. Moreover, consider that different relations may play different roles, we need to distinguish their importance as well, which can be modelled as an ensemble multi-relational graph regularization as follows:


where and are non-negative trade-off parameters. is the weight corresponding to relation , and the sum of weights is 1 for constraining the search space of possible graph Laplacians. The regularization term is to avoid the parameter overfitting to only one relation [Geng et al.2012].

In addition to the topological constraint by term, we should also build the relationship between the learned representation with the node features . Therefore, there is a feature fitting term: , which makes encode information from the original feature , so as to alleviate the over-smoothing problem. Finally, our proposed optimization framework for multi-relational graphs, which includes constraints on features and topology, is as follows:


By minimizing the above objective function, the optimal representation not only captures the smoothness between nodes, but also preserves the personalized information. Moreover, the optimal relational coefficients can be derived, reflecting the importance of different relations.

3.2 Ensemble Message Passing Mechanism

It is nontrivial to directly optimize and together because Eq.(4) is not convex w.r.t. jointly. Fortunately, an iterative optimization strategy can be adopted, i.e., i.) first optimizing Eq.(4) w.r.t. with a fixed , resulting in the solution of relational coefficients ; ii.) then solving Eq.(4) w.r.t. with taking the value solved in the last iteration. We will show that performing the above two steps corresponds to one ensemble message passing layer in our relational GNNs.

Update Relational Coefficients

We update relational parameters by fixing , then the objective function (4) w.r.t. is reduced to:


where .

(1) When , the coefficient might concentrate on one certain relation, i.e., if , and otherwise. When , each relation will be assigned equal coefficient, i.e.,  [Geng et al.2012].

(2) Otherwise, theoretically, Eq.(5) can be regarded as a convex function of with the constraint in a standard simplex [Chen and Ye2011], i.e., . Therefore, the mirror entropic descent algorithm (EMDA) [Beck and Teboulle2003] can be used to optimize , where the update process is described by Algorithm 1. The objective should be a convex Lipschitz continuous function with Lipschitz constant for a fixed given norm. Here, we derive this Lipschitz constant from , where .

Update Node Representation

Then we update node representation with fixing , where the objective function Eq. (4) w.r.t. is reduced to:


We can set the derivative of Eq. (6) with respect to to zero and get the optimal as:


Since the eigenvalue of

is positive, it has an inverse matrix, and we can obtain the closed solution as follows:


However, obtaining the inverse of matrix will cause a computational complexity and memory requirement of , which is inoperable in large-scale graphs. Therefore, we can approximate Eq.(9) using the following iterative update rule:


where is the iteration number.

0:  Candidate Laplacians , the embedding matrix , the Lipschitz constant , the tradeoff parameters .
0:  Relational coefficients .
1:  Initialization:
2:  for  to  do
4:     repeat
5:        ,
6:        ,
7:        ,
8:     until Convergence
9:  end for
10:  return  
Algorithm 1 Relational Coefficients Learning

Ensemble Message Passing Layer (EnMP layer)

Now with the node representation and the relation coefficient , we can propose our ensemble message passing layer, consisting of the following two steps: (1) relational coefficient learning step (RCL step), i.e., update the relational coefficients according to Algorithm 1; (2) propagation step (Pro step), i.e., update the node representation according to Eq.(10). The pseudocode of EnMP layer is shown in appendix A. We will show some properties of our proposed EnMP layer.

Remark 1 (Relationship with Multi-Relational/Path Personalized PageRank).

Given a realtion

, we have the relation based probability transition matrix

. Then, the single relation based PageRank matrix is calculated via:


Considering we have relations, i.e., , the weights of each relation are , according to [Lee et al.2013, Ji et al.2021a], we can define the multiple relations based PageRank matrix:


According to [Klicpera et al.2018], the multi-relational personalized PageRank matrix can be defined:


where is a normalized adjacency matrix with self-loops, represents unit matrix, is teleport (or restart) probability. If , the closed-form solution in Eq.(9) is to propagate features via multi-relational personalized PageRank scheme.

Remark 2 (Relationship with APPNP/GCN).

if , the solution in Eq.(5) is , i.e., each relaiton is assigned equal coefficient, then the ensemble multi-relational graph reduces to a normalized adjacency matrix averaged over all relations. The proposed message passing scheme reduces to:


if , it recovers the message passing in APPNP on the averaged relational graph:


if , it recovers the message passing in GCN on the averaged relational graph:

Figure 1: Model architecture.

3.3 Ensemble Multi-Relational GNNs

Now we propose our ensemble multi-relational graph neural networks (EMR-GNN) with the EnMP layer. Similar as [Klicpera et al.2018], we employ the decoupled style architecture, i.e., the feature transformation and the message passing layer are separated. The overall framework is shown in Figure 1, and the forward propagation process is as follows:


where is the input feature of nodes, and denotes the MLPs or linear layers (parameterized by

) which is used to feature extraction.

represents our designed ensemble relational message passing layer with layers, where is the number of relations, and

are hyperparameters in our message passing layer.

is MLPs as classifier with the learnable parameters

. The training loss is: where is a discriminator function of cross-entropy, and are the predicted and ground-truth labels of node

, respectively. Backpropagation manner is used to optimize parameters in MLPs, i.e.,

and , and the parameters in our EnMP layers are optimized during the forward propagation. We can see that EMR-GNN is built on a clear optimization objective. Besides, EMR-GNN also has the following two advantages:

  • As analyzed by Remark 1, our proposed EnMP can keep the original information of the nodes with a teleport (or restart) probability, thereby alleviating over-smoothing.

  • For each relation, there is a parameterized relation-specific weight matrix or parameterized relation encoder used in the traditional RGCN [Vashishth et al.2019, Schlichtkrull et al.2018]. While in our EnMP, only one learnable weight coefficient is associated with a relation, greatly alleviating the over-parameterization problem.

Datasets Nodes Node Types Edges Edge Types Target Classes
MUTAG 23,644 1 74,227 23 molecule 2
BGS 333,845 1 916,199 103 rock 2
DBLP 26,128 4 239,566 6 author 4
ACM 10,942 4 547,872 8 paper 3
Table 1: Statistics of multi-relational datasets.
Metric Acc () Recall () Acc () Recall () Acc () Recall () Acc () Recall ()
GCN 90.39±0.38 89.49±0.52 89.58±1.47 89.47±1.49 72.35±2.17 63.28±2.95 85.86±1.96 80.21±2.21
GAT 91.97±0.40 91.25±0.58 88.99±1.58 88.89±1.56 70.74±2.13 63.01±3.79 88.97±3.17 86.13±4.96
HAN 91.73±0.61 91.15±0.72 88.51±0.35 88.50±0.30 - - - -
RGCN 90.08±0.60 88.56±0.76 89.79±0.62 89.71±0.59 71.32±2.11 61.97±3.52 85.17±5.87 81.58±7.94
e-RGCN 91.77±0.90 91.18±1.02 83.00±1.04 84.03±0.75 69.41±2.84 67.57±8.04 82.41±1.96 84.51±3.38
EMR-GNN 93.54±0.50 92.39±0.78 90.87±0.11 90.84±0.13 74.26±0.78 64.19±1.08 89.31±4.12 86.39± 5.33
Table 2:

The mean and standard deviation of classification accuracy and recall over 10 different runs on four datasets.

Table 3: Comparison of the number of parameters. Here, denotes the number of layers in the model, is the embedding dimension, represents the number of bases, indicates the total number of relations in the graph and is the number of heads of attention-based models.
(a) GCN (SC=0.6196)
(b) GAT (SC=0.6351)
(c) RGCN (SC=0.6093)
(d) EMR-GNN (SC=0.7799)
Figure 2: Visualization of the learned node embeddings on DBLP dataset.
(a) DBLP
(b) ACM
Figure 3: Analysis of propagation layers. Missing value in red line means CUDA is out of memory.
Figure 4: Accuracy under each single relation and corresponding relational coefficient.

4 Experiment

4.1 Experimental Settings


The following four real-world heterogeneous datasets in various fields are utilized and can be divided into two categories: i) the node type and edge type are both heterogeneous (DBLP  [Fu et al.2020], ACM [Lv et al.2021]). ii) the node type is homogeneous but the edge type is heterogeneous (MUTAG [Schlichtkrull et al.2018], BGS [Schlichtkrull et al.2018]). The statistics of the datasets can be found in Table 1. The basic information about datasets is summarized in appendix B.1.


To test the performance of the proposed EMR-GNN, we compare it with five state-of-the-art baselines. Among them, GCN [Kipf and Welling2016] and GAT [Veličković et al.2017] as two popular approaches are included. In addition, we compare with the heterogeneous graph model HAN [Wang et al.2019], since HAN can also employ multiple relations. Two models that are specially designed for multi-relational graphs are compared, i.e., RGCN [Schlichtkrull et al.2018] and e-RGCN [Thanapalasingam et al.2021].

Parameter settings.

We implement EMR-GNN based on Pytorch.

333 For and , we choose one layer MLP for DBLP and ACM, and linear layers for MUTAG and BGS. We conduct 10 runs on all datasets with the fixed training/validation/test split for all experiments. More implementation details can be seen in appendix B.3.

4.2 Node Classification Results

Table 2 summarizes the performances of EMR-GNN and several baselines on semi-supervised node classification task. Since HAN’s code uses the heterogeneity of nodes to design meta-paths, we do not reproduce the results of HAN on homogeneous dataset (MUTAG, BGS) with only one type of nodes. We use accuracy (Acc) and recall metrics for evaluation, and report the mean and standard deviation of classification accuracy and recall. We have the following observations: (1) Compared with all baselines, the proposed EMR-GNN generally achieves the best performance across all datasets on seven of the eight metrics, which demonstrates the effectiveness of our proposed model. e-RGCN has a higher recall but a lower accuracy on MUTAG, which may be caused by overfitting. (2) Meanwhile, the number of parameters of our model and other baselines are shown in Table 3. We can see that EMR-GNN is more parameter efficient than all baselines, i.e., , but achieves maximum relative improvements of than RGCN on BGS. It means that EMR-GNN largely overcomes the over-parameterization in previous multi-relational GNNs.

4.3 Model Analysis

Alleviating over-smoothing problem.

As mentioned before, EMR-GNN is able to alleviate over-smoothing issue. Here, we take one typical single-relation GCN (GAT) and one representative multi-relational GCN (RGCN) as baselines to test their performance with different propagation depths, where the results are shown in Figure.3. We have the following observations: (1) Our model significantly alleviates the over-smoothing problem, since there is generally no performance degradation when the depth increases. This benefits from the adjustable factor in EMR-GNN, which flexible controls the influence of node feature information. In contrast, the performance of RGCN and GAT drops seriously with increasing depth, implying that these models suffer from the over-smoothing problem. (2) RGCN needs huge storage cost, making it difficult to stack multiple layers. Cuda out of memory occurs when the propagation depth increases, i.e., DBLP for more than 16 layers, and ACM can merely stack 8 layers. This is not available for capturing long-range dependencies.

Datasets Method Size of training set
DBLP RGCN 0.5381 0.6388 0.7515 0.7721
EMR-GNN 0.8768 0.9109 0.9128 0.9364
ACM RGCN 0.7492 0.8136 0.8278 0.8344
EMR-GNN 0.8489 0.8654 0.8739 0.8753
Table 4: Classification accuracy w.r.t. different training set.
Alleviating over-parameterization problem.

To further illustrate the advantages of alleviating over-parameterization, we verify EMR-GNN with small-scale training samples. We conduct experiments on EMR-GNN and RGCN using two datasets. We only select a small part of nodes from original training samples as the new training samples. As can be seen in Table 4, EMR-GNN consistently outperforms RGCN with different training sample ratios, which again validates the superiority of the proposed method. One reason is that a limited number of parameters in EMR-GNN can be fully trained with few samples. In contrast, RGCN with excess parameters requires large-scale training samples as the number of relations increases. The time complexity is analyzed in appendix C.

Analysis of relational coefficients.

Besides the performance, we further show that EMR-GNN can produce reasonable relational coefficients. To verify the ability of relational coefficients learning, taking ACM dataset as example, we evaluate the classification performance under each single relation. The classification accuracy and the corresponding relational coefficient value are reported in Figure 4. We can see that basically, the relation which achieves better accuracy is associated with a larger coefficient. Moreover, we compute the pearson correlation coefficient between the accuracy of a single relation and its relational coefficient, which is 0.7918, well demonstrating that they are positively correlated.


For a more intuitive comparison, we conduct the task of visualization on DBLP dataset. We plot the output embedding on the last layer of EMR-GNN and three baselines (GCN, GAT and RGCN) using t-SNE [Van der Maaten and Hinton2008]. All nodes in Figure 2 are colored by the ground truth labels. It can be observed that EMR-GNN performs best, since the significant boundaries between nodes of different colors, and the relatively dense distribution of nodes with the same color. However, the nodes with different labels of GCN and RGCN are mixed together. In addition, we also calculate the silhouette coefficients (SC) of the classification results of different models, and EMR-GNN achieves the best score, furthering indicating that the learned representations of EMR-GNN have a clearer structure.

5 Conclusion

In this work, we study how to design multi-relational graph neural networks from the perspective of optimization objective. We propose an ensemble optimization framework, and derive a novel ensemble message passing layer. Then we present the ensemble multi-relational GNNs (EMR-GNN), which has nice relationship with multi-relational/path personalized PageRank and can recover some popular GNNs. EMR-GNN not only is designed with clear objective function, but also can well alleviate over-smoothing and over-parameterization issues. Extensive experiments demonstrate the superior performance of EMR-GNN over the several state-of-the-arts.


The research was supported in part by the National Natural Science Foundation of China (Nos. 61802025, 61872836, U1936104) and Meituan.