1 Introduction
GNN is a kind of neural networks that performs neural network operations over graph structure to learn node representations. Owing to the ability to learn more comprehensive node representations than the models that consider only node features [31] or graph structure [19], GNN has been a promising solution for a wide range of applications in social science [6]
[14], and recommendation [24, 13] etc. To date, most graph convolution operations in GNNs are implemented as a linear aggregation (i.e., weighted sum) over features of the neighbors of the target node [15]. Although it improves the representation of the target node, such linear aggregation assumes that the neighbor nodes are independent of each other, ignoring the possible interactions between them.Under some circumstances, the interactions between neighbor nodes could be a strong signal that indicates the characteristics of the target node. Figure 1
(left) illustrates a toy example of a target node and its neighbors in a transaction graph, where edges denote money transfer relations and nodes are described by a set of features such as age and income. The interaction between node 1 and 2, which indicates that both have high incomes, could be a strong signal to estimate the credit rating of the target node (an intuition is that a customer who has close business relations with rich friends would have a higher chance to repay a loan). Explicitly modeling such interactions between neighbors highlights the common properties within the local structure, which could be rather helpful for the target node’s representation. In Figure
1 (right), we show that the summationbased linear aggregator — a common choice in existing GNNs — fails to highlight the income feature. In contrast, by using a multiplicationbased aggregator that captures node interactions, the signal latent in shared high incomes is highlighted, and as an auxiliary effect, some less useful features are zeroed out.Nevertheless, it is nontrivial to encode such local node interactions in GNN. The difficulty mainly comes from two indispensable requirements of a feasible graph convolution operation: 1) permutation invariant [29], i.e., the output should remain the same when the order of neighbor nodes is changed, so as to ensure the stability of a GNN; and 2) linear complexity [15], i.e., the computational complexity should increase linearly with respect to the number of neighbors, so as to make a GNN scalable on large graphs. To this end, we take inspiration from neural factorization machines [12] to devise a new bilinear aggregator, which explicitly expresses the interactions between every two nodes and aggregates all pairwise interactions to enhance the target node’s representation.
On this basis, we develop a new graph convolution operator which is equipped with both traditional linear aggregator and the newly proposed bilinear aggregator, and is proved to be permutation invariant. We name the new model as Bilinear Graph Neural Network (BGNN), which is expected to learn more comprehensive representations by considering local node interactions. We devise two BGNN models, named BGCN and BGAT, which are equipped with the GCN and GAT linear aggregator, respectively. Taking semisupervised node classification as an example task, we evaluate BGCN and BGAT on three benchmark datasets to validate their effectiveness. Specifically, BGCN and BGAT outperform GCN and GAT by 1.6% and 1.5%, respectively. More finegrained analyses show that the improvements on sparsely connected nodes are more significant, demonstrating the strengths of the bilinear aggregator in modeling node interactions. The main contributions of this paper are summarized as:

[leftmargin=*]

We propose BGNN, a simple yet effective GNN framework, which explicitly encodes the local node interactions to augment conventional linear aggregator.

We prove that the proposed BGNN model has the properties of permutation invariant and linear computation complexity which are of importance for GNN models.

We conduct extensive experiments on three public benchmarks of semisupervised node classification, validating the effectiveness of the proposed BGNN models.
2 Related Work
GNN generalizes traditional convolutional neural networks from Euclidean space to graph domain. According to the format of the convolution operations, existing GNN models can be divided into two categories: spatial GNN and spectral GNN
[32]. We separately review the two kinds of models, and refer the mathematical connection between them to [3].Spectral GNN.
Spectral GNN is defined as performing convolution operations in the Fourier domain with spectral node representations [4, 7, 15, 17, 28]. Bruna et al. [4]
define the convolution over the eigenvectors of graph Laplacian which are viewed as the Fourier basis. Considering the high computational cost of the eigendecomposition, research on spectral GNN has been focused on approximating the decomposition with different mathematical techniques
[7, 15, 17, 28]. For instance, [7] introduce the Chebyshev polynomials with orders of to approximate the eigendecomposition. In [15], Kipf and Welling simplify this model by limitingand approximating the largest eigenvalue of Laplacian matrix by
. In addition, Liao et al. [17] employ the Lanczos algorithm to perform a lowrank approximation of the graph Laplacian. Recently, Wavelet transform is introduced to spectral GNN to discard the eigendecomposition [28]. However, spectral GNN models are hard to be applied on large graphs such as social networks. This is because the convolution operations are required to be performed over the whole graph, posing unaffordable memory cost and incapacitating the widely applied batch training.Spatial GNN.
Spatial GNN instead performs convolution operations directly over the graph structure by aggregating the features from spatially close neighbors to a target node [1, 10, 15, 22, 30, 27, 23, 29, 8]. This line of research is mainly focused on developing aggregation methods from different perspectives. For instance, Kipf and Welling [15] propose to use a linear aggregator (i.e., weighted sum) that uses the reverse of node degree as the coefficient. To improve the representation performance, neural attention mechanism is introduced to learn the coefficients [22]. In addition to aggregating information from directly connected neighbors, augmented aggregators also account for multihop neighbors [1, 30]
. Moreover, nonlinear aggregators are also employed in spatial GNNs such as max pooling
[10], capsule [23], and Long ShortTerm Memory (LSTM)
[10]. Furthermore, spatial GNN is extended to graphs with both static and temporal neighbors structure [18] and representations in hyperbolic space [5].However, most existing aggregators (both linear and nonlinear ones) forgo the importance of the interactions among neighbors. As built upon the summation operation, by nature, the linear aggregators assume that neighbors are independent. Most of the nonlinear ones are focused on the property of neighbors at set level (i.e., all neighbors), e.g., the ”skeleton” of the neighbors [29]. Taking one neighbor as the input of a timestep, LSTMbased aggregator could capture sequential dependency, which might include node interactions. However, it requires a predefined order on neighbor, violating permutation invariant and typically showing weak performance [10]. Our work is different from those aggregators in that we explicitly consider pairwise node interactions in a neat and systematic way.
3 Bilinear Graph Neural Network
Preliminaries.
Let be the graph of interest, where A is the binary adjacency matrix where an element means that an edge exists between node and , and X
is the original feature matrix for nodes that describes each node with a vector of size
(a row). We denote the neighbors of node as which stores all nodes that have an edge with , and denote the extended neighbors of node as which contains the node itself. For convenience, we use to denote the degree of node , i.e., , and accordingly . The model objective is to learn a representation vector for each node , such that its characteristics are properly encoded. For example, the label of node can be directly predicted as a function output , without the need of looking into the graph structure and original node features in .The spatial GNN [22] achieves this goal by recursively aggregating the features from neighbors:
(1) 
where denotes the representation of target node at the th layer/iteration, is the weight matrix (model parameter) to do feature transformation at the th layer, and the initial feature representation can be obtained from the original feature matrix X.
The function is typically implemented as a weighted sum with as the weight of neighbor . In GCN [15], is defined as , which is grounded on the Laplacian theories. The recent advance on graph attention network (GAT) [22] learns
from data, which has the potential to lead better performance than predefined choices. However, a limitation of such weighted sum is that no interactions between neighbor representations are modeled. Although using more powerful feature transformation function such as multilayer perceptron (MLP)
[29] can alleviate the problem, the process is rather implicit and ineffective. An empirical evidence is from [2], which shows that MLP is inefficient in capturing the multiplication relations between input features. In this work, we propose to explicitly inject the multiplicationbased node interactions into function.3.1 Bilinear Aggregator
As demonstrated in Figure 1, the multiplication between two vectors is an effective manner to model the interactions — emphasizing common properties and diluting discrepant information. Inspired by factorization machines (FMs) [20, 12]
that have been intensively used to learn the interactions among categorical variables, we propose a bilinear aggregator which is suitable for modeling the neighbor interactions in local structure:
(2) 
where is elementwise product; is the target node to obtain representation for; and are node index from the extended neighbors — they are constrained to be different to avoid selfinteractions that are meaningless and may even introduce extra noises. denotes the number of interactions for the target node , which normalizes the obtained representation to remove the bias of node degree. It is worth noting that we take the target node itself into account and aggregate information from extended neighbors, which although looks same as GNN, but for different reasons. In GNN, accounting for the target node is to retain its information during layerwise aggregation, working like the residual learning [11]. While in BGNN, our consideration is that the interactions between the target node and its neighbors may also carry useful signal. For example, for sparse nodes that have only one neighbor, the interaction between neighbors does not exist, and the interaction between the target and neighbor nodes can be particularly helpful.
Time Complexity Analysis.
At the first sight, the bilinear aggregator considers all pairwise interactions between neighbors (including the target node), thus may have a quadratic time complexity w.r.t. the neighbor count, being higher than the weighted sum. Nevertheless, through a mathematical reformulation similar to that one used in FM, we can compute the aggregator in linear time — — the same complexity as weighted sum. To show this, we rewrite Equation (2) in its equivalent form as:
(3) 
where . As can be seen, through mathematical reformulation, we can reduce the sum over pairwise elementwise products to the minus of two terms, where each term is a weighted sum of neighbor representations (or their squares) and can be computed in time. Note that multiplying weight matrix W is a standard operation in aggregator thus its time cost is omitted for brevity.
Proof of Permutation Invariant.
This property is intuitive to understand from the reduced Equation (3): when changing the order of input vectors, the sum of inputs (the first term) and the sum of the squares of inputs (the second term) are not changed. Thus the output is unchanged and the permutation invariant property is satisfied. To provide a rigorously proof, we give the matrix form of the bilinear aggregator, which also facilitates the matrixwise implementation of BGNN. The matrix form of the bilinear aggregator is:
(4) 
where stores the representation vectors h for all nodes, is the adjacency matrix of the graph with selfloop added on each node (
is an identity matrix),
B is a diagonal matrix with each element , and denotes the elementwise product of two matrices.Let be any permutation matrix that satisfies (1) , and (2) for any matrix M if PM exists, then satisfies. When we apply the permutation P on the nodes, H changes to PH, changes to and B changes to , which leads to:
which indicates the permutation invariant property.
3.2 BGNN Model
We now describe the proposed BGNN model. As the bilinear aggregator emphasizes node interactions and encodes different signal with the weighted sum aggregator, we combine them to build a more expressive graph convolutional networkWe adopt a simple linear combination scheme, defining a new graph convolution operator as:
(5)  
where stores the node representations at the th layer (encoded hop neighbors). is a hyperparameter to tradeoff the strengths of traditional GNN aggregator and our proposed bilinear aggregator. Figure 2 illustrates the model framework.
Since both and are permutation invariant, it is trivial to find that this graph convolution operator is also permutation invariant. When sets to 0, no node interaction is considered and BGNN degrades to GNN; when sets to 1, BGNN only uses the bilinear aggregator to process the information from the neighbors. Our empirical studies show that an intermediate value between 0 and 1 usually leads to better performance, verifying the efficacy of modeling node interactions, and the optimal setting varies on different datasets.
Multilayer BGNN.
Traditional GNN models [15, 22, 29] encode information from multihop neighbors in a recursive manner by stacking multiple aggregators. For example, the 2layer GNN model is formalized as,
(6) 
where
is a nonlinear activation function. Similarly, we can devise a 2layer BGNN model in the same recursive manner:
(7) 
However, such a straightforward multilayer extension involves unexpected higherorder interactions. In the twolayer case, the secondlayer representation will include partial 4thorder interactions among the twohop neighbors, which are hard to interpret and unreasonable. When extending BGNN to multiple layers, saying layers, we still hope to capture pairwise interactions, but between the hop neighbors. To this end, instead of directly stacking layers, we define the 2layer model as:
(8)  
where stores the 2hop connectivities of the graph. is an entrywise operation that transforms nonzero entries to 1. As such, a nonzero entry in means node can reach node within two hops. is a hyperparameter to tradeoff the strengths of bilinear interactions within 1hop neighbors and 2hop neighbors.
Following the same principle, we define the layer BGNN as:
(9) 
where denotes the adjacency matrix of hop connectivities, and denotes normal layer GNN that can be defined recursively such as a layer GCN or GAT. The time complexity of a layer BGNN is determined by the number of nonzero entries in . To reduce the actual complexity, one can follow the sampling strategy in GraphSage [10], sampling a portion of highhop neighbors rather than using all neighbors.
Model Training.
BGNN is a differentiable model, thus it can be endtoend optimized on any differential loss with gradient descent. In this work, we focus on the semisupervised node classification task, optimizing BGNN with the crossentropy loss on labeled nodes (same setting as the GCN work [15] for a fair comparison). As the experimented data is not large, we implement the layerwise graph convolution in its matrix form, leaving the batch implementation and neighbor sampling which can scale to large graphs as future work.
4 Experiments
Model  1layer  RI  2layer  RI  
Pubmed  Cora  Citeseer  Pubmed  Cora  Citeseer  
SemiEmb          
DeepWalk          
Planetoid          
GCN  
GAT  
GIN  
BGCNA      
BGCNT  78.0 0.2      
BGATA      
BGATT  79.6 0.6  71.4 1.3    79.8 0.3  84.2 0.4  74.0 0.3   
Datasets.
Following previous works [21, 31, 22], we utilize three benchmark datasets of citation network—Pubmed, Cora and Citeseer [21]
. In these datasets, nodes and edges represent documents and citation relations between documents, respectively. Each node is represented by the bagofwords features extracted from the content of the document. Each node has a label with onehot encoding of the document category. We employ the same data split in previous works
[15, 31, 22]. That is, 20 labeled nodes per class are used for training. 500 nodes and 1000 nodes are used as validation set and test set, respectively. Note that the train process can use all of the nodes’ features. For this data split, we report the average test accuracy over ten different random initializations. To save space, we refer [15] for the detailed statistics of the three datasets.Compared Methods.
We compare against the strong baselines mainly in two categories: network embedding and GNN. We select three widely used network embedding approaches: graph regularizationbased network embedding (SemiEmb) [25] and skipgrambased graph embedding (DeepWalk [19] and Planetoid [31]). For GNNs, we select GCN [15], GAT [22] and Graph Isomorphism Network (GIN) [29].
We devise two BGNNs which implement the function as GCN and GAT, respectively. For each BGNN, we compare two variants with different scopes of the bilinear interactions: 1) BGCNA and BGATA which consider all nodes within the hop neighbourhood, including the target node in the bilinear interaction. 2) BGCNT and BGATT, which consider the interactions between the target node and the neighbor nodes within its hop neighbourhood.
Parameter Settings.
We closely follow the GCN work [15] to set the hyperparameters of SemiEmb, DeepWalk, and Planetoid. We perform gridsearch to select the optimal values for hyperparameters of the remaining methods, including the dropout rate, the weight for norm (), the tradeoff the aggregated information from multihop nodes, and the that balances the linear aggregator and bilinear aggregator. The dropout rates, , and are selected within , , and
, respectively. All BGNNbased models are trained for 2,000 epochs with an early stopping strategy based on both convergence behavior and accuracy of the validation set.
4.1 Performance Comparison
Table 1 shows the performance of the compared methods on the three datasets w.r.t. prediction accuracy on the data split exactly same as in [15]. From the table, we have the following observations:

[leftmargin=*]

In all cases, the proposed BGNN models achieves the best performance with average improvements over the baselines larger than 1.5%. The results validate the effectiveness of BGNN which is attributed to incorporating the pairwise interactions between the nodes in the local structure (i.e., the ego network of the target node) when performing graph convolution.

On average, BGAT (BGCN) outperforms vanilla GAT (GCN) by 1.5% (1.6%). These results further indicate the benefit of considering the interaction between neighbor nodes, which could augment the representation of a target node, facilitating its classification. Furthermore, the improvements of BGAT and BGCN in the 1layer and 2layer settings are close, which indicates that the interactions between both 1hop neighbors and 2hop neighbors are helpful for the representation of a target node.

BGNN models, which have different scopes of the bilinear interactions, achieve different performance across datasets. In most cases, BGATT (BGCNT) achieves performance better than BGATA (BGCNA), signifying the importance of interactions with the target node.
Split  Model  Pubmed  Cora  Citeseer 
Random  GCN  
BGCNT  77.9 1.1  80.3 1.1  71.6 1.1  
Fixed  GCN  
BGCNT  79.4 0.1  82.0 0.1  71.9 0.0 
As reported in [15] (Table 2), the performance of GCN on random data splits is significantly worse than the fixed data split. As such, following [26], we also test the methods on 10 random splits of the training set while keeping the validation and test sets unchanged. Table 2 shows the performance of BGCNT and GCN over random splits. To save space, we omit the results of BGCNA and BGATbased models which show similar trends. As can be seen, BGCNT still outperforms GCN with high significant level (), which further validates the effectiveness of the proposed model. However, the performance of both BGCNT and GCN suffers from random data split as compared to the fixed data split. This result is consistent with previous work [26] and reasonable since the hyperparameters are tuned on the fixed data split.
4.2 Study of BGNN
Impacts of Bilinear Aggregator.
As the BA is at the core of BGNN, we first investigate its impacts on the performance by varying the value of . Note that larger means more contributions from the BA; BGNN will downgrade to vanilla GNN with only the linear aggregator by setting , while being fully dependent on the BA by setting . Figures 2(a) and 2(b) show the performance of BGCNT with 1layer and 2layer on Cora and Citeseer datasets, respectively. We omit the results of other BGCNbased and BGATbased models and results on Pubmed for saving space, which show similar trends. We have the following observations: 1) Under the two settings (1layer and 2layer), the performance of BGCNT varies in a range from 67.5 to 82.1. It suggests a careful tuning of would make our models achieve desired performance. 2) BGCNT outperforms vanilla GCN in most cases. It again verifies that the BA is capable of capturing the complex patterns of information propagation, which are hard to reveal by the linear aggregator individually. 3) Surprisingly, the performance of BGCNT with is much worse than the performance when is set to the optimal value. One possible reason is that the BA mainly serves as the complementary component to the linear aggregator, hardly working alone to achieve the comparable performance.
Impacts of MultiHop Neighbors.
We also study the effects of , in order to explore such tradeoff between the aggregated information from different hops. Note that setting as 0 and 1 denotes the individual modeling of one and twohop neighbors, respectively. As Figure 3 shows, we observe that involving the pairwise interactions from one and twohop neighbors simultaneously achieves better performance. It again verifies the effectiveness of stacking more BAs.
4.3 InDepth Analysis of Aggregators
We perform indepth analysis of different aggregators to clarify their working mechanism with respect to two node characteristics — 1) Degree, which denotes the average numbers of (one and twohop) neighbors surrounding the target node, and 2) Ratio, we first count the number of (one and twohop) neighbors which have the same label with the target node, and then divide this number by the number of all one and twohop neighbors. We summarize our statistical results in Table 3, wherein the symbol and
denote whether the target nodes are correctly classified or misclassified, respectively. That is, we categorize the testing nodes into three groups according to the correctness of predictions from GCN and BGCNT. Jointly analyzing the three categories corresponding to the three rows in Table
3, we have the following findings: 1) Focusing on the third category with the least degree, BGCNT consistently outperforms GCN, suggesting that the bilinear aggregator is able to distill useful information from sparser neighbors. 2) Comparing the third category to the second one, we observe that BGCNT is able to endow the predictor node denoising ability. That is, BGCNT can effectively aggregate information from the neighbors with consistent labels, filtering out the useless information from the irrelevant neighbors. 3) We also realize the limitations of BGCNT from the second category — the bilinear interaction might need more labelconsistent neighbors (i.e., larger ratio), when more neighbors are involved (i.e., larger degree).GCN  BGCNT  Pubmed  Cora  Citeseer  
Degree  Ratio  Degree  Ratio  Degree  Ratio  
5 Conclusion
In this paper, we proposed BGNN, a new graph neural network framework, which augments the expressiveness of vanilla GNN by considering the interactions between neighbor nodes. The neighbor node interactions are captured by a simple but carefully devised bilinear aggregator. The simpleness of the bilinear aggregator makes BGNN have the same model complexity as vanilla GNN w.r.t. the number of learnable parameters and analytical time complexity. Furthermore, the bilinear aggregator is proved to be permutation invariant which is an important property for GNN aggregators [10, 29]. We applied the proposed BGNN on the semisupervised node classification task, achieving stateoftheart performance on three benchmark datasets. In future, we plan to explore the following research directions: 1) encoding highorder interactions among multiple neighbors, 2) exploring the effectiveness of deeper BGNNs with more than two layers, and 3) developing AutoML technique [9] to adaptively learn the optimal and for each neighbor.
References
 [1] (2016) Diffusionconvolutional neural networks. In NeurIPS, pp. 1993–2001. Cited by: §2.
 [2] (2018) Latent cross: making use of context in recurrent recommender systems. In WSDM, pp. 46–54. Cited by: §3.

[3]
(2017)
Geometric deep learning: going beyond euclidean data
. IEEE Signal Processing Mag 34 (4), pp. 18–42. Cited by: §2.  [4] (2014) Spectral networks and locally connected networks on graphs. ICLR. Cited by: §2.
 [5] (2019) Hyperbolic graph convolutional neural networks. In NeurIPS, pp. 4869–4880. Cited by: §2.
 [6] (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. ICLR. Cited by: §1.
 [7] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, pp. 3844–3852. Cited by: §2.
 [8] (2019) Temporal relational ranking for stock prediction. ACM Transactions on Information Systems (TOIS) 37 (2), pp. 1–30. Cited by: §2.

[9]
(2015)
Efficient and robust automated machine learning
. In NeurIPS, pp. 2962–2970. Cited by: §5.  [10] (2017) Inductive representation learning on large graphs. In NeurIPS, pp. 1024–1034. Cited by: §2, §2, §3.2, §5.
 [11] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.1.
 [12] (2017) Neural factorization machines for sparse predictive analytics. In SIGIR, pp. 355–364. Cited by: §1, §3.1.
 [13] (2020) LightGCN: simplifying and powering graph convolution network for recommendation. In SIGIR, Cited by: §1.

[14]
(2019)
Rethinking knowledge graph propagation for zeroshot learning
. In CVPR, pp. 11487–11496. Cited by: §1.  [15] (2017) Semisupervised classification with graph convolutional networks. ICLR. Cited by: §1, §1, §2, §2, §3, §3.2, §3.2, 4th item, §4, §4, §4, §4.1, §4.1.

[16]
(2018)
Deeper insights into graph convolutional networks for semisupervised learning
. In AAAI, Cited by: Table 1.  [17] (2019) LanczosNet: multiscale deep graph convolutional networks. In ICLR, Cited by: §2.
 [18] (2019) Exploiting interaction links for node classification with deep graph neural networks. In IJCAI, pp. 3223–3230. Cited by: §2.
 [19] (2014) Deepwalk: online learning of social representations. In KDD, pp. 701–710. Cited by: §1, §4.
 [20] (2010) Factorization machines. In ICDM, pp. 995–1000. Cited by: §3.1.
 [21] (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.
 [22] (2018) Graph attention networks. ICLR. Cited by: §2, §3, §3, §3.2, 4th item, §4, §4.
 [23] (2019) Deep graph infomax. In ICLR, Cited by: §2.
 [24] (2019) Neural graph collaborative filtering. In SIGIR, pp. 165–174. Cited by: §1.
 [25] (2012) Deep learning via semisupervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Cited by: §4.
 [26] (2019) Simplifying graph convolutional networks. ICML, pp. 6861–6871. Cited by: §4.1.
 [27] (2019) Capsule graph neural network. In ICLR, Cited by: §2.
 [28] (2019) Graph wavelet neural network. In ICLR, Cited by: §2.
 [29] (2019) How powerful are graph neural networks?. ICLR. Cited by: §1, §2, §2, §3, §3.2, §4, §5.
 [30] (2018) Representation learning on graphs with jumping knowledge networks. ICML, pp. 8676–8685. Cited by: §2.
 [31] (2016) Revisiting semisupervised learning with graph embeddings. ICML, pp. 86–94. Cited by: §1, 4th item, §4, §4.
 [32] (2018) Deep learning on graphs: a survey. arXiv preprint arXiv:1812.04202. Cited by: §2.