GNN is a kind of neural networks that performs neural network operations over graph structure to learn node representations. Owing to the ability to learn more comprehensive node representations than the models that consider only node features  or graph structure , GNN has been a promising solution for a wide range of applications in social science 14], and recommendation [24, 13] etc. To date, most graph convolution operations in GNNs are implemented as a linear aggregation (i.e., weighted sum) over features of the neighbors of the target node . Although it improves the representation of the target node, such linear aggregation assumes that the neighbor nodes are independent of each other, ignoring the possible interactions between them.
Under some circumstances, the interactions between neighbor nodes could be a strong signal that indicates the characteristics of the target node. Figure 1
(left) illustrates a toy example of a target node and its neighbors in a transaction graph, where edges denote money transfer relations and nodes are described by a set of features such as age and income. The interaction between node 1 and 2, which indicates that both have high incomes, could be a strong signal to estimate the credit rating of the target node (an intuition is that a customer who has close business relations with rich friends would have a higher chance to repay a loan). Explicitly modeling such interactions between neighbors highlights the common properties within the local structure, which could be rather helpful for the target node’s representation. In Figure1 (right), we show that the summation-based linear aggregator — a common choice in existing GNNs — fails to highlight the income feature. In contrast, by using a multiplication-based aggregator that captures node interactions, the signal latent in shared high incomes is highlighted, and as an auxiliary effect, some less useful features are zeroed out.
Nevertheless, it is non-trivial to encode such local node interactions in GNN. The difficulty mainly comes from two indispensable requirements of a feasible graph convolution operation: 1) permutation invariant , i.e., the output should remain the same when the order of neighbor nodes is changed, so as to ensure the stability of a GNN; and 2) linear complexity , i.e., the computational complexity should increase linearly with respect to the number of neighbors, so as to make a GNN scalable on large graphs. To this end, we take inspiration from neural factorization machines  to devise a new bilinear aggregator, which explicitly expresses the interactions between every two nodes and aggregates all pair-wise interactions to enhance the target node’s representation.
On this basis, we develop a new graph convolution operator which is equipped with both traditional linear aggregator and the newly proposed bilinear aggregator, and is proved to be permutation invariant. We name the new model as Bilinear Graph Neural Network (BGNN), which is expected to learn more comprehensive representations by considering local node interactions. We devise two BGNN models, named BGCN and BGAT, which are equipped with the GCN and GAT linear aggregator, respectively. Taking semi-supervised node classification as an example task, we evaluate BGCN and BGAT on three benchmark datasets to validate their effectiveness. Specifically, BGCN and BGAT outperform GCN and GAT by 1.6% and 1.5%, respectively. More fine-grained analyses show that the improvements on sparsely connected nodes are more significant, demonstrating the strengths of the bilinear aggregator in modeling node interactions. The main contributions of this paper are summarized as:
We propose BGNN, a simple yet effective GNN framework, which explicitly encodes the local node interactions to augment conventional linear aggregator.
We prove that the proposed BGNN model has the properties of permutation invariant and linear computation complexity which are of importance for GNN models.
We conduct extensive experiments on three public benchmarks of semi-supervised node classification, validating the effectiveness of the proposed BGNN models.
2 Related Work
GNN generalizes traditional convolutional neural networks from Euclidean space to graph domain. According to the format of the convolution operations, existing GNN models can be divided into two categories: spatial GNN and spectral GNN. We separately review the two kinds of models, and refer the mathematical connection between them to .
define the convolution over the eigenvectors of graph Laplacian which are viewed as the Fourier basis. Considering the high computational cost of the eigen-decomposition, research on spectral GNN has been focused on approximating the decomposition with different mathematical techniques[7, 15, 17, 28]. For instance,  introduce the Chebyshev polynomials with orders of to approximate the eigen-decomposition. In , Kipf and Welling simplify this model by limiting
and approximating the largest eigenvalue of Laplacian matrix by. In addition, Liao et al.  employ the Lanczos algorithm to perform a low-rank approximation of the graph Laplacian. Recently, Wavelet transform is introduced to spectral GNN to discard the eigen-decomposition . However, spectral GNN models are hard to be applied on large graphs such as social networks. This is because the convolution operations are required to be performed over the whole graph, posing unaffordable memory cost and incapacitating the widely applied batch training.
Spatial GNN instead performs convolution operations directly over the graph structure by aggregating the features from spatially close neighbors to a target node [1, 10, 15, 22, 30, 27, 23, 29, 8]. This line of research is mainly focused on developing aggregation methods from different perspectives. For instance, Kipf and Welling  propose to use a linear aggregator (i.e., weighted sum) that uses the reverse of node degree as the coefficient. To improve the representation performance, neural attention mechanism is introduced to learn the coefficients . In addition to aggregating information from directly connected neighbors, augmented aggregators also account for multi-hop neighbors [1, 30]
. Moreover, non-linear aggregators are also employed in spatial GNNs such as max pooling, capsule 
, and Long Short-Term Memory (LSTM). Furthermore, spatial GNN is extended to graphs with both static and temporal neighbors structure  and representations in hyperbolic space .
However, most existing aggregators (both linear and non-linear ones) forgo the importance of the interactions among neighbors. As built upon the summation operation, by nature, the linear aggregators assume that neighbors are independent. Most of the non-linear ones are focused on the property of neighbors at set level (i.e., all neighbors), e.g., the ”skeleton” of the neighbors . Taking one neighbor as the input of a time-step, LSTM-based aggregator could capture sequential dependency, which might include node interactions. However, it requires a predefined order on neighbor, violating permutation invariant and typically showing weak performance . Our work is different from those aggregators in that we explicitly consider pairwise node interactions in a neat and systematic way.
3 Bilinear Graph Neural Network
Let be the graph of interest, where A is the binary adjacency matrix where an element means that an edge exists between node and , and X
is the original feature matrix for nodes that describes each node with a vector of size(a row). We denote the neighbors of node as which stores all nodes that have an edge with , and denote the extended neighbors of node as which contains the node itself. For convenience, we use to denote the degree of node , i.e., , and accordingly . The model objective is to learn a representation vector for each node , such that its characteristics are properly encoded. For example, the label of node can be directly predicted as a function output , without the need of looking into the graph structure and original node features in .
The spatial GNN  achieves this goal by recursively aggregating the features from neighbors:
where denotes the representation of target node at the -th layer/iteration, is the weight matrix (model parameter) to do feature transformation at the -th layer, and the initial feature representation can be obtained from the original feature matrix X.
The function is typically implemented as a weighted sum with as the weight of neighbor . In GCN , is defined as , which is grounded on the Laplacian theories. The recent advance on graph attention network (GAT)  learns
from data, which has the potential to lead better performance than pre-defined choices. However, a limitation of such weighted sum is that no interactions between neighbor representations are modeled. Although using more powerful feature transformation function such as multi-layer perceptron (MLP) can alleviate the problem, the process is rather implicit and ineffective. An empirical evidence is from , which shows that MLP is inefficient in capturing the multiplication relations between input features. In this work, we propose to explicitly inject the multiplication-based node interactions into function.
3.1 Bilinear Aggregator
As demonstrated in Figure 1, the multiplication between two vectors is an effective manner to model the interactions — emphasizing common properties and diluting discrepant information. Inspired by factorization machines (FMs) [20, 12]
that have been intensively used to learn the interactions among categorical variables, we propose a bilinear aggregator which is suitable for modeling the neighbor interactions in local structure:
where is element-wise product; is the target node to obtain representation for; and are node index from the extended neighbors — they are constrained to be different to avoid self-interactions that are meaningless and may even introduce extra noises. denotes the number of interactions for the target node , which normalizes the obtained representation to remove the bias of node degree. It is worth noting that we take the target node itself into account and aggregate information from extended neighbors, which although looks same as GNN, but for different reasons. In GNN, accounting for the target node is to retain its information during layer-wise aggregation, working like the residual learning . While in BGNN, our consideration is that the interactions between the target node and its neighbors may also carry useful signal. For example, for sparse nodes that have only one neighbor, the interaction between neighbors does not exist, and the interaction between the target and neighbor nodes can be particularly helpful.
Time Complexity Analysis.
At the first sight, the bilinear aggregator considers all pairwise interactions between neighbors (including the target node), thus may have a quadratic time complexity w.r.t. the neighbor count, being higher than the weighted sum. Nevertheless, through a mathematical reformulation similar to that one used in FM, we can compute the aggregator in linear time — — the same complexity as weighted sum. To show this, we rewrite Equation (2) in its equivalent form as:
where . As can be seen, through mathematical reformulation, we can reduce the sum over pairwise element-wise products to the minus of two terms, where each term is a weighted sum of neighbor representations (or their squares) and can be computed in time. Note that multiplying weight matrix W is a standard operation in aggregator thus its time cost is omitted for brevity.
Proof of Permutation Invariant.
This property is intuitive to understand from the reduced Equation (3): when changing the order of input vectors, the sum of inputs (the first term) and the sum of the squares of inputs (the second term) are not changed. Thus the output is unchanged and the permutation invariant property is satisfied. To provide a rigorously proof, we give the matrix form of the bilinear aggregator, which also facilitates the matrix-wise implementation of BGNN. The matrix form of the bilinear aggregator is:
where stores the representation vectors h for all nodes, is the adjacency matrix of the graph with self-loop added on each node (
is an identity matrix),B is a diagonal matrix with each element , and denotes the element-wise product of two matrices.
Let be any permutation matrix that satisfies (1) , and (2) for any matrix M if PM exists, then satisfies. When we apply the permutation P on the nodes, H changes to PH, changes to and B changes to , which leads to:
which indicates the permutation invariant property.
3.2 BGNN Model
We now describe the proposed BGNN model. As the bilinear aggregator emphasizes node interactions and encodes different signal with the weighted sum aggregator, we combine them to build a more expressive graph convolutional networkWe adopt a simple linear combination scheme, defining a new graph convolution operator as:
where stores the node representations at the -th layer (encoded -hop neighbors). is a hyper-parameter to trade-off the strengths of traditional GNN aggregator and our proposed bilinear aggregator. Figure 2 illustrates the model framework.
Since both and are permutation invariant, it is trivial to find that this graph convolution operator is also permutation invariant. When sets to 0, no node interaction is considered and BGNN degrades to GNN; when sets to 1, BGNN only uses the bilinear aggregator to process the information from the neighbors. Our empirical studies show that an intermediate value between 0 and 1 usually leads to better performance, verifying the efficacy of modeling node interactions, and the optimal setting varies on different datasets.
is a non-linear activation function. Similarly, we can devise a 2-layer BGNN model in the same recursive manner:
However, such a straightforward multi-layer extension involves unexpected higher-order interactions. In the two-layer case, the second-layer representation will include partial 4th-order interactions among the two-hop neighbors, which are hard to interpret and unreasonable. When extending BGNN to multiple layers, saying layers, we still hope to capture pairwise interactions, but between the -hop neighbors. To this end, instead of directly stacking layers, we define the 2-layer model as:
where stores the 2-hop connectivities of the graph. is an entry-wise operation that transforms non-zero entries to 1. As such, a non-zero entry in means node can reach node within two hops. is a hyper-parameter to trade-off the strengths of bilinear interactions within 1-hop neighbors and 2-hop neighbors.
Following the same principle, we define the -layer BGNN as:
where denotes the adjacency matrix of -hop connectivities, and denotes normal -layer GNN that can be defined recursively such as a -layer GCN or GAT. The time complexity of a -layer BGNN is determined by the number of non-zero entries in . To reduce the actual complexity, one can follow the sampling strategy in GraphSage , sampling a portion of high-hop neighbors rather than using all neighbors.
BGNN is a differentiable model, thus it can be end-to-end optimized on any differential loss with gradient descent. In this work, we focus on the semi-supervised node classification task, optimizing BGNN with the cross-entropy loss on labeled nodes (same setting as the GCN work  for a fair comparison). As the experimented data is not large, we implement the layer-wise graph convolution in its matrix form, leaving the batch implementation and neighbor sampling which can scale to large graphs as future work.
|BGAT-T||79.6 0.6||71.4 1.3||-||79.8 0.3||84.2 0.4||74.0 0.3||-|
. In these datasets, nodes and edges represent documents and citation relations between documents, respectively. Each node is represented by the bag-of-words features extracted from the content of the document. Each node has a label with one-hot encoding of the document category. We employ the same data split in previous works[15, 31, 22]. That is, 20 labeled nodes per class are used for training. 500 nodes and 1000 nodes are used as validation set and test set, respectively. Note that the train process can use all of the nodes’ features. For this data split, we report the average test accuracy over ten different random initializations. To save space, we refer  for the detailed statistics of the three datasets.
We compare against the strong baselines mainly in two categories: network embedding and GNN. We select three widely used network embedding approaches: graph regularization-based network embedding (SemiEmb)  and skip-gram-based graph embedding (DeepWalk  and Planetoid ). For GNNs, we select GCN , GAT  and Graph Isomorphism Network (GIN) .
We devise two BGNNs which implement the function as GCN and GAT, respectively. For each BGNN, we compare two variants with different scopes of the bilinear interactions: 1) BGCN-A and BGAT-A which consider all nodes within the -hop neighbourhood, including the target node in the bilinear interaction. 2) BGCN-T and BGAT-T, which consider the interactions between the target node and the neighbor nodes within its -hop neighbourhood.
We closely follow the GCN work  to set the hyper-parameters of SemiEmb, DeepWalk, and Planetoid. We perform grid-search to select the optimal values for hyper-parameters of the remaining methods, including the dropout rate, the weight for -norm (), the trade-off the aggregated information from multi-hop nodes, and the that balances the linear aggregator and bilinear aggregator. The dropout rates, , and are selected within , , and
, respectively. All BGNN-based models are trained for 2,000 epochs with an early stopping strategy based on both convergence behavior and accuracy of the validation set.
4.1 Performance Comparison
In all cases, the proposed BGNN models achieves the best performance with average improvements over the baselines larger than 1.5%. The results validate the effectiveness of BGNN which is attributed to incorporating the pairwise interactions between the nodes in the local structure (i.e., the ego network of the target node) when performing graph convolution.
On average, BGAT (BGCN) outperforms vanilla GAT (GCN) by 1.5% (1.6%). These results further indicate the benefit of considering the interaction between neighbor nodes, which could augment the representation of a target node, facilitating its classification. Furthermore, the improvements of BGAT and BGCN in the 1-layer and 2-layer settings are close, which indicates that the interactions between both 1-hop neighbors and 2-hop neighbors are helpful for the representation of a target node.
BGNN models, which have different scopes of the bilinear interactions, achieve different performance across datasets. In most cases, BGAT-T (BGCN-T) achieves performance better than BGAT-A (BGCN-A), signifying the importance of interactions with the target node.
|BGCN-T||77.9 1.1||80.3 1.1||71.6 1.1|
|BGCN-T||79.4 0.1||82.0 0.1||71.9 0.0|
As reported in  (Table 2), the performance of GCN on random data splits is significantly worse than the fixed data split. As such, following , we also test the methods on 10 random splits of the training set while keeping the validation and test sets unchanged. Table 2 shows the performance of BGCN-T and GCN over random splits. To save space, we omit the results of BGCN-A and BGAT-based models which show similar trends. As can be seen, BGCN-T still outperforms GCN with high significant level (), which further validates the effectiveness of the proposed model. However, the performance of both BGCN-T and GCN suffers from random data split as compared to the fixed data split. This result is consistent with previous work  and reasonable since the hyper-parameters are tuned on the fixed data split.
4.2 Study of BGNN
Impacts of Bilinear Aggregator.
As the BA is at the core of BGNN, we first investigate its impacts on the performance by varying the value of . Note that larger means more contributions from the BA; BGNN will downgrade to vanilla GNN with only the linear aggregator by setting , while being fully dependent on the BA by setting . Figures 2(a) and 2(b) show the performance of BGCN-T with 1-layer and 2-layer on Cora and Citeseer datasets, respectively. We omit the results of other BGCN-based and BGAT-based models and results on Pubmed for saving space, which show similar trends. We have the following observations: 1) Under the two settings (1-layer and 2-layer), the performance of BGCN-T varies in a range from 67.5 to 82.1. It suggests a careful tuning of would make our models achieve desired performance. 2) BGCN-T outperforms vanilla GCN in most cases. It again verifies that the BA is capable of capturing the complex patterns of information propagation, which are hard to reveal by the linear aggregator individually. 3) Surprisingly, the performance of BGCN-T with is much worse than the performance when is set to the optimal value. One possible reason is that the BA mainly serves as the complementary component to the linear aggregator, hardly working alone to achieve the comparable performance.
Impacts of Multi-Hop Neighbors.
We also study the effects of , in order to explore such trade-off between the aggregated information from different hops. Note that setting as 0 and 1 denotes the individual modeling of one- and two-hop neighbors, respectively. As Figure 3 shows, we observe that involving the pairwise interactions from one- and two-hop neighbors simultaneously achieves better performance. It again verifies the effectiveness of stacking more BAs.
4.3 In-Depth Analysis of Aggregators
We perform in-depth analysis of different aggregators to clarify their working mechanism with respect to two node characteristics — 1) Degree, which denotes the average numbers of (one- and two-hop) neighbors surrounding the target node, and 2) Ratio, we first count the number of (one- and two-hop) neighbors which have the same label with the target node, and then divide this number by the number of all one- and two-hop neighbors. We summarize our statistical results in Table 3, wherein the symbol and
denote whether the target nodes are correctly classified or misclassified, respectively. That is, we categorize the testing nodes into three groups according to the correctness of predictions from GCN and BGCN-T. Jointly analyzing the three categories corresponding to the three rows in Table3, we have the following findings: 1) Focusing on the third category with the least degree, BGCN-T consistently outperforms GCN, suggesting that the bilinear aggregator is able to distill useful information from sparser neighbors. 2) Comparing the third category to the second one, we observe that BGCN-T is able to endow the predictor node denoising ability. That is, BGCN-T can effectively aggregate information from the neighbors with consistent labels, filtering out the useless information from the irrelevant neighbors. 3) We also realize the limitations of BGCN-T from the second category — the bilinear interaction might need more label-consistent neighbors (i.e., larger ratio), when more neighbors are involved (i.e., larger degree).
In this paper, we proposed BGNN, a new graph neural network framework, which augments the expressiveness of vanilla GNN by considering the interactions between neighbor nodes. The neighbor node interactions are captured by a simple but carefully devised bilinear aggregator. The simpleness of the bilinear aggregator makes BGNN have the same model complexity as vanilla GNN w.r.t. the number of learnable parameters and analytical time complexity. Furthermore, the bilinear aggregator is proved to be permutation invariant which is an important property for GNN aggregators [10, 29]. We applied the proposed BGNN on the semi-supervised node classification task, achieving state-of-the-art performance on three benchmark datasets. In future, we plan to explore the following research directions: 1) encoding high-order interactions among multiple neighbors, 2) exploring the effectiveness of deeper BGNNs with more than two layers, and 3) developing AutoML technique  to adaptively learn the optimal and for each neighbor.
-  (2016) Diffusion-convolutional neural networks. In NeurIPS, pp. 1993–2001. Cited by: §2.
-  (2018) Latent cross: making use of context in recurrent recommender systems. In WSDM, pp. 46–54. Cited by: §3.
Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Mag 34 (4), pp. 18–42. Cited by: §2.
-  (2014) Spectral networks and locally connected networks on graphs. ICLR. Cited by: §2.
-  (2019) Hyperbolic graph convolutional neural networks. In NeurIPS, pp. 4869–4880. Cited by: §2.
-  (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. ICLR. Cited by: §1.
-  (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, pp. 3844–3852. Cited by: §2.
-  (2019) Temporal relational ranking for stock prediction. ACM Transactions on Information Systems (TOIS) 37 (2), pp. 1–30. Cited by: §2.
Efficient and robust automated machine learning. In NeurIPS, pp. 2962–2970. Cited by: §5.
-  (2017) Inductive representation learning on large graphs. In NeurIPS, pp. 1024–1034. Cited by: §2, §2, §3.2, §5.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.1.
-  (2017) Neural factorization machines for sparse predictive analytics. In SIGIR, pp. 355–364. Cited by: §1, §3.1.
-  (2020) LightGCN: simplifying and powering graph convolution network for recommendation. In SIGIR, Cited by: §1.
Rethinking knowledge graph propagation for zero-shot learning. In CVPR, pp. 11487–11496. Cited by: §1.
-  (2017) Semi-supervised classification with graph convolutional networks. ICLR. Cited by: §1, §1, §2, §2, §3, §3.2, §3.2, 4th item, §4, §4, §4, §4.1, §4.1.
Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, Cited by: Table 1.
-  (2019) LanczosNet: multi-scale deep graph convolutional networks. In ICLR, Cited by: §2.
-  (2019) Exploiting interaction links for node classification with deep graph neural networks. In IJCAI, pp. 3223–3230. Cited by: §2.
-  (2014) Deepwalk: online learning of social representations. In KDD, pp. 701–710. Cited by: §1, §4.
-  (2010) Factorization machines. In ICDM, pp. 995–1000. Cited by: §3.1.
-  (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.
-  (2018) Graph attention networks. ICLR. Cited by: §2, §3, §3, §3.2, 4th item, §4, §4.
-  (2019) Deep graph infomax. In ICLR, Cited by: §2.
-  (2019) Neural graph collaborative filtering. In SIGIR, pp. 165–174. Cited by: §1.
-  (2012) Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Cited by: §4.
-  (2019) Simplifying graph convolutional networks. ICML, pp. 6861–6871. Cited by: §4.1.
-  (2019) Capsule graph neural network. In ICLR, Cited by: §2.
-  (2019) Graph wavelet neural network. In ICLR, Cited by: §2.
-  (2019) How powerful are graph neural networks?. ICLR. Cited by: §1, §2, §2, §3, §3.2, §4, §5.
-  (2018) Representation learning on graphs with jumping knowledge networks. ICML, pp. 8676–8685. Cited by: §2.
-  (2016) Revisiting semi-supervised learning with graph embeddings. ICML, pp. 86–94. Cited by: §1, 4th item, §4, §4.
-  (2018) Deep learning on graphs: a survey. arXiv preprint arXiv:1812.04202. Cited by: §2.