1 Introduction
Graph representation learning aims to transform nodes on the graph into lowdimensional dense vectors whilst still preserving the attribute features of nodes and structure features of graphs. These node embeddings can then be fed into downstream machine learning algorithms to facilitate graph analytical tasks, such as node classification
[15, 24], link prediction [29], and community detection [4].In recent years, there has been a surge of research interest in utilizing neural networks to handle graphstructured data. Among them, graph convolutional networks (GCNs) have been shown effective in graph representation learning. They can model complex attribute features and structure features of graphs and achieve the stateoftheart performance on various tasks. The core of graph convolution is that nodes learn their representations by aggregating features from their neighbors, i.e. the “neighborhood aggregation” scheme. Recently, some graph convolutional models, which primarily differ in the neighborhood aggregation strategies, have been proposed
[15, 24, 30, 27]. For example, GCN [15] can be seen as the approximation of aggregation on the firstorder neighbors; GraphSAGE [10] designs several aggregators for inductive learning, where unlabeled data does not appear in the training process; GAT [24] introduces the attention mechanism to model influence of neighbors with learnable parameters.From a historical perspective, machine learning research has gone through a long process of development, with one clear trend from simple and linear models to complex and nonlinear models. For example, limitations of the linear support vector machine (SVM) motivated the development of nonlinear and more expressive kernelbased SVM classifiers
[2]. Besides, similar trends can be observed in the realm of image processing as realworld data distribution is usually rather complex. For example, simple and linear image filters [11]are gradually superseded by nonlinear convolutional neural networks (CNNs)
[16]. Driven by the significance of modeling complex and nonlinear distributions of data, a question arises: are existing GCNs capable enough to model the complex and nonlinear distributions of graphs?We find that most previous graph convolutional models (e.g., GCN and GAT) are usually shallow with only one or two nonlinear activation function layers, which may restrict the model from well capturing the complicated nonlinearity of graph data.
In this paper, we first theoretically prove that the effect of nonlinear activation functions in GCNs is to introduce the interaction terms of neighborhood features. We then show that coefficients of the neighborhood interacting terms are relatively small in current GCNbased models. To this end, we present a general framework named GraphAIR (Aggregation and InteRaction). The key idea behind our approach is to explicitly model the neighborhood interaction in addition to neighborhood aggregation, which can better capture the complex and nonlinear node features. As illustrated in Figure 1, GraphAIR consists of two parts, i.e. aggregation and interaction. The aggregation module constructs node representations by combining features from neighborhoods; the interaction module explicitly models neighborhood interactions through multiplication.
Nevertheless, several challenges exist in modeling the neighborhood interaction. Firstly, different nodes may have various numbers of adjacent neighbors, leading to different numbers of interaction pairs among neighbors. Thereby, defining a universal neighborhood interaction operator which is able to handle arbitrary numbers of interaction pairs is challenging. Secondly, it is preferable to propose a general plugandplay interaction module instead of designing modelspecific neighborhood interaction strategies for different GCNbased models.
To tackle the aforementioned challenges, we derive that the neighborhood interaction can be easily obtained through the multiplication of node embeddings. As a result, both of the neighborhood aggregation module and the neighborhood interaction module can be implemented by most existing graph convolutional layers.
In a nutshell, the main contributions of this paper are threefold. Firstly, to best of our knowledge, it is the first work to explicitly model neighborhood interaction for capturing nonlinearity of graphstructured data. Secondly, the proposed GraphAIR can easily integrate offtheshelf graph convolutional models, which shows favorable generality. Thirdly, extensive experiments conducted on benchmark tasks of node classification and link prediction show that GraphAIR achieves the stateoftheart performance.
2 Background and Preliminaries
In this section, we firstly introduce the notations used throughout the paper and then summarize some of the most common GCN models. Last, we briefly introduce residual learning which we employ in our model.
2.1 Notations
Let be an undirected graph with nodes, where is the adjacency matrix, is the feature attribute matrix, and denotes the attribute of node . Please kindly note that in this paper we primarily focus on undirected graphs, but our proposed method can be easily generalized to work with weighted or directed graphs.
2.2 Aggregators in Graph Convolutional Models
As mentioned above, existing GCNs mainly differ in the neighborhood aggregation functions. The representative graph convolutional model such as GCN [15] and GAT [24] can be formulated as:
(1)  
(2) 
where is the embedding of the ^{th} node resulting from the ^{th} graph convolutional layer, is a learnable weight matrix, is a scalar which indicates the importance of node ’s features to node , and . is the activation function, e.g., and is the set containing the firstorder neighbors of node as well as node
itself. To obtain the node embedding, a linear transformation is first conducted to project features to a new feature subspace. Then, the node embedding can be updated by weighted summation over the projected features of its neighbors, followed by a nonlinear activation function.
Different models adopt different strategies to design the aggregators. For GCN, it uses a predefined weight matrix for summarization, where is the adjacency matrix with selfloops and . Here, entry of is a predefined weight factor for weighted summarization over neighborhoods, i.e. in Eq.(2). Unlike GCN, GAT makes use of the attention mechanism to explicitly learn as follows:
(3)  
where
is a selfattention function, which can be simply implemented as a feedforward neural network.
The implicit and insufficient neighborhood interaction involved in existing GCNs. It is seen from Eq. (2
) that without the activation function, the node representation would depend linearly on the neighborhood features. Then, although mainstream models adopt nonlinear activation functions, which is able to introduce the neighborhood interaction implicitly as a side effect, they still face challenges in learning the neighborhood interaction sufficiently. We take the sigmoid function
as an example and approximate it with Taylor polynomials. Note that mainstream GCNbased models use piecewise nonsaturating activation functions, such as and . These functions suppress negative values yet are still linear for positive values. Here we analyze the sigmoid function as it brings more nonlinearity. Since the elements in the node embeddings are small^{1}^{1}1Most existing graph convolutional models, including GCN, GraphSAGE, and GAT normalize the input and initialize the weights using Glorot initialization [9]., the highorder interacting terms among the neighborhoods are small as well. Then, we just analyze the coefficients of highorder interacting terms, which is claimed in the following proposition.Proposition 1.
When applying the sigmoid function on the result of the linear combination as formulated in Eq. (2), the equivalent coefficient of highorder interacting terms of the neighborhood embeddings is at most .
Proof.
The sigmoid function can be approximated as Taylor polynomials at :
(4) 
where is the degree of the polynomial. The approximation error can be bounded using the Lagrange form of the remainder:
(5) 
Since the coefficient of the quadratic term is zero, we set and analyze the contribution of highorder interacting terms. Then, replacing with , Eq. (2) can be written as follows:
(6) 
where is the bound of the remainder whose absolute value is at most , which concludes the proof. Detailed proof is given in Appendix C in the supplementary material. ∎
Remark.
Proposition 1 states that the effect of nonlinear activation functions in GCNs is to introduce the interaction terms of neighborhood features. Importantly, the coefficients of the neighborhood interacting terms in current GCNbased models are relatively small, leading to a negligible contribution to node representations. As existing GCNs are usually shallow with only one or two nonlinear layers to avoid oversmoothing and overfitting [18], nonlinearity of graph data cannot be learned sufficiently.
2.3 Residual Learning
In this paper, we employ residual learning to combine the neighborhood aggregation and interaction. Residual learning [12]
is a widelyused building block for deep learning. Suppose
is the true and desired mapping and is the suboptimal representation which serves as the input feature to the residual module. Residual learning can be formulated as:(7) 
where is a residual function. Practically, we can apply a few nonlinear layers to obtain the suboptimal representation and some other nonlinear layers to implement the residual function . The essence of residual learning lies in the skip connection, through which the earlier representations are able to flow to later layers. The skip connection enables more direct reuse of the suboptimal representation and improves the information flow during forward and backward propagation [12], which makes the network easier to be optimized. Many approaches [12, 17] have shown that residual learning helps break away from the local optimum and improving the performance.
3 The Proposed Method: GraphAIR
In this section, we firstly formulate the model of neighborhood interaction and then describe how the parameters of GraphAIR model can be learned. Finally, we summarize the overall model architecture and analyze the computational complexity.
3.1 Modeling the Neighborhood Interaction with Residual Functions
As discussed in Section 2.2, the node representation resulting from the neighborhood aggregation scheme is less likely to well capture complicated nonlinearity of graphs because they learn the neighborhood interaction implicitly and inefficiently. In this section, we describe the embedding generation algorithm of GraphAIR, which aims to incorporate the neighborhood interaction into node representations. To begin with, a natural idea to model the quadratic terms of neighborhood interaction is formulated as:
(8) 
where is the neighborhood interaction representation of node , denotes the coefficient of the quadratic term, and is the elementwise multiplication operator. However, it is infeasible to learn in our case. For each node , there are
coefficients to estimate, which exposes the risk of overfitting. To alleviate this problem, we simply assign
as the product of importance weights and . The simplification is reasonable with the following aspects. For node , if and are large, then the neighbor nodes and should be considered as important factors for the representation of node . Compared to other interacting terms, the interaction between node and are likely to provide more relevant information about node . Consequently, should be large. In contrast, if and are small, neighbor nodes and may have a slight impact on node . Thus the interacting coefficient should be small as well. Formally, we arrive at:(9)  
where denotes the representation resulting from neighborhood aggregation.
In order to introduce more nonlinearity to our model, we apply nonlinear activation function on the two representations resulting from neighborhood aggregation and neighborhood interaction respectively. Besides, to combine these two representations, we add them using a skip connection:
(10) 
However, although we adopt a skip connection here, we argue that we still cannot benefit from residual learning, where both of the suboptimal representation and the residual function are implemented by different nonlinear layers. As formulated in Eqs. (9,10), the two representations resulting from neighborhood aggregation and interaction are based on the same weight matrix , which means the variations of the two representations during the backpropagation process are highly correlated. According to Bengio et al. (2013), it is important to disentangle the factors of variation to the representations as only a few factors tend to change at a time. Therefore, to make use of residual learning which can ease the optimization, we introduce another weight matrix to disentangle learning the neighborhood interaction from neighborhood aggregation. Formally, instead of Eq. (9), we use the following equation to learn the neighborhood interaction in our model:
(11) 
where the first term denotes the representation resulting from neighborhood aggregation and the second term provides the other half node representation for multiplication in the interaction process. is the input representation to the residual module and is the learnable weight of the residual function. Note that both terms and can be implemented by existing graph convolutional layers. Thus the proposed GraphAIR framework is compatible with most existing GCNbased models and it provides a plugandplay module for the neighborhood interaction.
3.2 Learning the Parameters of GraphAIR
In this section, we introduce how to learn the parameters under the GraphAIR framework. As we aim to propose a general approach for graph representation learning, we can apply different kinds of graphbased loss function, such as the proximity ranking loss in link prediction tasks and the crossentropy loss in node classification tasks. Without loss of generality, we take the task of node classification as an example.
To compute the probability that each node belongs to a certain class, existing GCNbased models usually employ one additional graph convolutional layer with a softmax classifier for prediction. Then, the output representation
is formulated as:(12) 
where is the prediction function, , and is the number of classes. Then, the loss of node classification can be calculated as where is the true label for node and is the crossentropy loss.
To obtain more accurate node embeddings and , we apply two auxiliary classifiers on and . Subsequently, the resulting representation for the neighborhood interaction will be more precise as well. Then, as formulated in Eq. (12), we apply one additional graph convolutional layer on each of , , and to attain , , and . Eventually, the overall objective function is the weighted sum of the three losses:
(13)  
where , , and
are hyperparameters controlling weights of the three loss functions. For training, we minimize the total loss
, while for inference, we only use , since and are to ensure is accurate enough.3.3 Model Architecture and Complexity Analysis
We suppose there are layers in the underlying graph convolutional model, where the last layer is employed for node classification. For GraphAIR, we employ two separate and symmetric branches, each of which consists of graph convolutional layers to obtain and . Then, considering and have aggregated enough information from neighborhoods, here we conduct the neighborhood interaction only once by multiplying and for the sake of efficiency. Additionally, we employ three graph convolutional layers followed by softmax activation functions on , , and . In summary, there will be layers in GraphAIR.
Each layer in GraphAIR has the same space and time complexity as the underlying model and the additional computation cost of GraphAIR is mainly introduced by the multiplication process for the neighborhood interaction. For the neighborhood interaction in Eq. (11), the cost is where is the embedding dimension. For each layer of the existing graph convolutional model such as GCN and GAT, it takes time to proceed Eq. (2). Therefore, the additional computation cost of neighborhood interaction is insignificant. That is to say, our proposed approach is as asymptotically efficient as the underlying graph convolutional model.
4 Evaluation
We extensively evaluate our proposed GraphAIR model on the node classification task and link prediction using five public datasets. Besides, we also conduct ablation studies on the neighborhood interaction module. For readers of interest, we include comparison of training time and all details of the experimental configurations in the supplementary material.
4.1 Datasets
We use five widelyused datasets to evaluate model performance on both transductive learning and inductive learning scenarios. Specifically, three citation networks (Cora, Citeseer, Pubmed) are used for tranductive node classification and link prediction, one knowledge graph (NELL) is used for transductive node classification, and one multigraph molecular network (PPI) is for inductive node classification. We exactly follow the setup in
[28, 15, 14, 24]. The statistics of datasets used throughout the experiments are summarized in Table 1.Citation networks. We build undirected citation networks from three datasets, where documents and citations are treated as nodes and edges respectively. We treat the bagofwords of each document as the feature vector. Our goal is to predict the class of each document. Only twenty labels per class are used for training.
Knowledge graph. The dataset collected from the knowledge base of Never Ending Language Learning (NELL) contains entities, relations, and text description. For every triplet , where and are entities and is the relationship between them, will be assigned with two separate nodes and . Then, we add two edges between and . For the knowledge graph, we conduct the entity classification. Similarly, we use bagofwords as feature vectors. Only one label per class is used for training.
Molecular network. We use the PPI (proteinprotein interaction) network that consists of twentyfour (24) graphs corresponding to different human tissues. Each node contains fifty (50) features composed of positional gene sets, motif gene sets, and immunological signatures. We select twenty (20) graphs as the training set, two (2) for validation, and two (2) for testing.
Dataset  Cora  Citeseer  Pubmed  NELL  PPI 
Task  Transductive  Inductive  
Type  Citation network  Knowledge graph  Molecular  
# Vertices  2,708  3,327  19,717  65,755  56,944 
# Edges  5,429  4,732  44,338  266,144  818,716 
# Classes  7  6  3  210  121 
# Features  1,433  3,703  500  5,414  50 
# Training nodes  140  120  60  210  44,906 
# Test nodes  1,000  1,000  1,000  1,000  5,524 
# Validation nodes  500  500  500  500  6,514 
4.2 Experiments on Node Classification
Baseline Methods
We comprehensively compare our method with various traditional randomwalkbased algorithms and stateoftheart GCNbased methods. We closely follow the experimental setting of previous work; the performance of those baselines is reported as in their original papers.^{2}^{2}2In experiments, we found that the results reported in Hamilton et al. (2017
) after ten epochs did not converge to the best values. For a fair comparison with other models, we reuse its official implementation and report the results of the baselines after 200 epochs.
Transductive node classification. In the transductive setting, the baselines include skipgrambased network embedding method DeepWalk [21], graph convolutional networks with higherorder Chebyshev filters (Planetoid) [28], graph convolution with onehop neighbors (GCN) [15], and graph attention networks (GAT) [24]. In addition, we further compare the performance of the proposed model with the recently proposed simplified graph convolutional networks (SGCs) [25] which removes redundant nonlinear activations. Also, we modify graph isomorphic networks (GINs) [26] which utilize nonlinear MLPs as the aggregation function for the node classification task. Note that since GIN was originally proposed for graph classification, we apply two GIN convolutional layers and remove the graphlevel readout function for the transductive node classification task.
Inductive node classification. For inductive node classification, we mainly compare GraphAIR with inductive graph convolutional networks (GraphSAGE) [10] and graph attention networks (GAT) [24]
. Note that GraphSAGE provides several variants of neighborhood aggregators: SAGEGCN concatenates the features of the neighborhoods and the central node, SAGEmean takes the average over neighborhood feature vectors, SAGELSTM combines neighborhood features by using a LSTM model, and SAGEpool uses an elementwise maxpooling operator to aggregate the neighborhood information nonlinearly.
Experimental Configurations
We employ our GraphAIR framework on top of three representative models, including GCN, GraphSAGE, and GAT, which is denoted by AIRGCN, AIRSAGE, and AIRGAT, respectively. Particularly, while GraphSAGE proposes several variants for neighborhood aggregation, among them only SAGEmean satisfies the coefficient normalization in Eq. (11). Therefore, we select SAGEmean as the base model for GraphAIR. For a fair comparison, we closely follow the same hyperparameters setting as the underlying graph convolutional model, such as learning rate, dropout rate, weight decay factor, hidden dimensions, etc. Considering GIN is originally proposed for graphlevel classification, the hidden dimensions are set to the same as GCN. In the experiment, we only tune the weights of three loss functions by grid search, where . For the transductive setting, we use the features of all data but only the labels of the training set are used for training. For the inductive setting, we train our model without the validation data and testing data. In addition, we report the average accuracy of 20 measurements.
Results and Analysis


Transductive. We summarize the results of transductive node classification in Table (a)a. Note that even we apply the sparse version implementation of GAT, it requires more than 64G memory on NELL dataset. Thus, the performance of GAT and AIRGAT is not reported. From the tables, it is seen that GraphAIR achieves stateoftheart performance over all datasets, which demonstrates the effectiveness of the proposed GraphAIR framework. SGC acquires comparable results to that of GCN, which corresponds to our conclusion in Proposition 1 that existing GCNs are not able to learn the nonlinearity of graph data sufficiently. For our proposed AIRGCN, it outperforms its base model GCN by margins of 3.2%, 2.6%, 1.0%, and 2.5%. The same trends hold for AIRGAT with its base model GAT as well. To sum up, the improvements demonstrate the effectiveness of modeling the nonlinear distributions of nodes.
In addition, another important observation is that, both AIRGAT and AIRGCN outperform the complex nonlinear opponents such as GIN. Although MLPs are able to asymptotically approximate any complicated and nonlinear functions theoretically, they tend to converge to undesired local minima in practice [22]. The experimental results prove the rationality of explicitly introducing neighborhood interaction.
Inductive. The results of inductive learning are shown in Table (b)b. AIRSAGEmean outperforms its base model SAGEmean by 1.8%. Besides, we can clearly observe that AIRGAT achieves the best performance. It is worth noting that the previous stateoftheart method has already reached pretty high performance and the proposed AIRGAT still acquires the improvement of 1.3% over the vanilla GAT. Besides, it is suggested that the proposed GraphAIR framework is also generalizable for multiple graphs.
4.3 Experiments on Link Prediction
In order to further verify our proposed framework is general for other graph representation learning tasks, we conduct experiments on link prediction additionally. We choose citation networks as benchmark datasets and compare against various stateoftheart methods, including graph autoencoders (GAE)
[14] and variational graph autoencoders (VGAE) [14], as well as other baseline algorithms, including SC [23] and DeepWalk [21]. We employ our GraphAIR framework on the basis of GAE, which constructs the graph autoencoder with GCNs. The resulting model is denoted by AIRGAE.We report the performance in terms of area under the ROC curve (AUC) based on the performance of 20 runs. The mean performance and standard error are presented in Table
3. It is shown from the table that the proposed AIRGAE outperforms its vanilla opponents GAE and VGAE, which once again verifies the necessity to incorporate the neighborhood interaction to neighborhood aggregation. Please note that previous stateoftheart methods have already obtained high enough performance on the Pubmed dataset and our method AIRGAE pushes the boundary with absolute improvements of 2.8%, achieving 99.2% in terms of AUC. Also, it can be observed that the proposed method obtains much more obvious improvements, compared with the performance of node classification. We suspect that this is primarily because models for the link prediction task usually employ pairwise decoders for calculating the probability of the link between two nodes. For example, GAE and VGAE assume the probability that there exists an edge between two nodes is proportional to the dot product of the embeddings of these two nodes. Therefore, our approach, which explicitly models the neighborhood interaction through the multiplication of the embeddings of two nodes, is inherently related to the link prediction task and obtains more improvements.Method  Cora  Citeseer  Pubmed 
SC  84.6% 0.01%  80.5% 0.01%  84.2% 0.02% 
DeepWalk  83.1% 0.01%  80.5% 0.02%  84.4% 0.00% 
GAE  91.0% 0.02%  89.5% 0.04%  96.4% 0.00% 
VGAE  91.4% 0.01%  90.8% 0.02%  94.4% 0.02% 
AIRGAE  95.4% 0.01%  95.0% 0.01%  99.2% 0.02% 
4.4 Ablation Studies on the Neighborhood Interaction Module
As we analyzed in Section 3.3, the number of parameters in GraphAIR is almost two times than that of the underlying graph convolutional model. In this section, we conduct ablation studies to answer the following questions:

Q1: How much improvement has the proposed neighborhood interaction module brought?

Q2: Does the disentangled residual learning strategy bring sufficient improvements?
To answer Q1 and verify the effectiveness of GraphAIR is introduced by the proposed neighborhood interaction module rather than the larger number of parameters in the model, we remove the neighborhood interaction module of AIRGCN. Then, the resulting model has exactly the same parameters as AIRGCN. As there are almost double parameters than vanilla GCN in the resulting model, we denote the resulting model as DPGCN (DoubleParameter GCN).
To answer Q2, we employ only one branch of graph convolutional networks consisting layers to produce the output representations. To obtain neighborhood interaction , we directly make use of the selfinteraction strategy described in Eq. (9) instead of Eq. (11). The resulting model is termed as selfIRGCN.
For a fair comparison, other experimental configurations are kept the same as AIRGCN. The results of node classification are presented in Table 4. It is seen from the table that the proposed AIRGCN achieves the best performance and outperforms DPGCN and selfIRGCN. For Q1, we can observe that DPGCN only obtains slightly better accuracy on Cora and Citeseer and almost the same performance as the vanilla GCN on Pubmed. It can be verified that the neighborhood interaction module mainly contributes to the performance improvement of the proposed AIRGCN model. For Q2, it is seen that the performance of selfIRGCN only gets slightly improved on three datasets, which demonstrates the rationality of modeling neighborhood interaction. However, disengaging the neighborhood interaction from neighborhood aggregation can bring more improvements.
Method  Cora  Citeseer  Pubmed 
GCN  81.5%  70.3%  79.0% 
DPGCN  82.3% 0.1%  71.0% 0.1%  79.0% 0.2% 
selfIRGCN  82.6% 0.0%  70.8% 0.2%  79.2% 0.1% 
AIRGCN  84.7% 0.1%  72.9% 0.1%  80.0% 0.1% 
5 Related Work
There have been a lot of attempts in recent literature to employ neural networks for graph representation learning. Among them, graph convolutional neural networks (GCNs) receive a lot of research interests. GCNbased models generally follow the neighborhood aggregation scheme. To be specific, the model passes the input signals from neighborhoods through filters to aggregate information. Many approaches design different strategies to aggregate information from nodes’ neighborhood. According to different strategies, these models can be roughly grouped into two categories, i.e. spectralbased approaches and spatialbased approaches.
One the one hand, spectral methods depend on the Laplacian eigenbasis to define parameterized filters. The first work [3] introduce convolutional operations in the Fourier domain by computing the eigendecomposition of the graph Laplacian, which results in potentially heavy computational burden. Following its work, Defferrard et al. (2016) propose to approximate filters using Chebyshev expansion of the graph Laplacian. Then, graph convolutional neural networks (GCNs) [15] have been widely applied for graph representation learning. The core of GCNs is the neighborhood aggregation scheme which generates node embedding by combining information from neighborhoods. Since GCN only captures local information, DGCN [30] then proposes to construct an information matrix to encode global consistency.
On the other hand, the spatial approaches directly operate on spatially close neighbors. To enable parameter sharing of filters across neighbors of different sizes, Duvenaud et al. (2015) first propose to learn weight matrices for different node degrees. MoNet [19] proposes a spatialdomain model to provide a unified convolutional network on graphs. To compute node representations in an inductive manner, GraphSAGE [10] samples fixedsize neighborhoods of nodes and performs aggregation over them. Similarly, Gao et al. (2018) select a fixed number of neighbors and enable the use of conventional convolutional operations on Euclidean spaces. Recently, GAT [24]
introduces attention mechanisms to graph neural networks, which computes hidden representations by attending over neighbors with a selfattention strategy.
Recently, some methods are proposed to focus on linearity and nonlinearity of graphs respectively. On the one hand, simplified graph convolutional networks (SGCs) [25] try to reduce the complexity and eliminate redundant computation of GCN by successively removing nonlinear activation functions. SGC makes assumptions that nonlinearity between GCN layers is not critical to the model performance and the majority of the benefit is brought by the neighborhood aggregation scheme. While being more computationally efficient, SGC achieves comparable empirical performance to vanilla GCN.
There are other methods arguing that modeling nonlinear distributions of node features can bring improvements. For example, GraphSAGELSTM [10]
employs the longshortterm memory (LSTM) module to learn the complex relationships between the nodes. Empirically, GraphSAGELSTM outperforms other aggregation functions such as GraphSAGEmean and GraphSAGEGCN. Graph isomorphic networks (GIN)
[26]apply multilayer perceptrons (MLPs) in each graph convolutional layer, which is able to model complex nonlinearity of graphs. Although theoretically it is well known that MLPs are universal approximators
[13], there is no formal theorem giving instructions on how to asymptotically approximate the desired function (Patterson 1998, p. 182; Fausett 1994, p. 328). Different from GraphSAGELSTM and GIN, to best of our knowledge, our work is the first to point out that most existing GCNs may not well capture nonlinearity of graph data and we demonstrate the effectiveness of explicitly modeling nonlinearity of graphs.6 Conclusion
In this paper, we have firstly proved that existing mainstream GCNbased models have difficulty in well capturing the complicated nonlinearity of graph data. Then, in order to better capture the complicated and nonlinear distributions of nodes, we have proposed a novel GraphAIR framework that explicitly models the neighborhood interaction in addition to the neighborhood aggregation scheme. By employing residual learning strategy, we disentangle learning the neighborhood interaction from the neighborhood aggregation, which makes the optimization easier. The proposed GraphAIR is compatible with most existing graph convolutional models and it can provide a plugandplay module for the neighborhood interaction. Finally, GraphAIR based on wellknown models including GCN, GraphSAGE, and GAT have been thoroughly investigated through empirical evaluation. Extensive experiments on benchmark tasks including node classification and link prediction demonstrate the effectiveness of our model.
Supplementary Material
Appendix A Detailed Proof of Proposition 1
Proposition 1.
When applying the sigmoid function on the result of the linear combination as formulated in Eq. (2), the equivalent coefficient of highorder interacting terms of the neighborhood embeddings is at most .
Proof.
The sigmoid function can be approximated as Taylor polynomials at :
(14) 
where is the degree of the polynomial. The approximation error can be bounded using the Lagrange form of the remainder:
(15) 
Since the coefficient of the quadratic term is zero, we set and analyze the contribution of highorder interacting terms. Then, replacing with , Eq. (2) can be written as follows:
(16) 
where is the bound of the reminder. To analyze its maximum value, we first get the third derivative of the sigmoid function:
(17)  
Then, making , we can calculate its roots:
(18)  
Therefore, the corresponding extreme values of are , , and . It is obvious to see the maximum absolute value of is . Therefore,
(19) 
which concludes the proof. ∎
Appendix B Experiments on Training Time
The training time per epoch on three citation datasets for the node classification task is summarized in Table 5. It is seen that GraphAIR is comparable to underlying models in training time, although it has almost double parameters.
Method  Cora  Citeseer  Pubmed 
GCN  3.97  3.62  3.97 
GIN+0  6.10  7.70  8.60 
GIN+  6.12  7.71  8.63 
AIRGCN  8.09  8.37  9.00 
GAT  10.60  10.18  14.90 
AIRGAT  25.70  25.75  34.1 
Appendix C Details on Experimental Settings
c.1 Dataset Configurations
The detailed dataset configurations for the node classification task are summarized in Table 6. For readers of interest, we also provide the links to download the dataset, as described in Table 7. For the link prediction task, 85% of the edges are selected to be the training dataset, while the remaining 10% and 5% of edges are chosen to be the test set and the validation set, respectively. Nodes, edges, or graphs constituting the training set, the test set, and the validation set are randomly selected.
Statistics  Cora  Citeseer  Pubmed  NELL  PPI 
# of classes  7  6  3  210  121 (multilabel) 
# of training nodes  140  120  60  210  44,906 (20 graphs) 
# of validation nodes  500  500  500  500  6,514 (2 graphs) 
# of test nodes  1,000  1,000  1,000  1,000  5,524 (2 graphs) 
c.2 Metrics and Measurements
For the node classification task, we report the performance in terms of accuracy. For the link prediction task, we report the performance in terms of the standard area under the ROC Curve (AUC) metric. The number of epoch is set to 2,000 and we employ the early stopping strategy whose window size is 1,000. Besides, we report averaged performance with standard deviations based on 20 measurements.
c.3 Computing Infrastructures
We implement GraphAIR based on the official implementation of GCN, GAT, and GraphSAGE using TensorFlow 1.11. The experiments are conducted on a computer server with eight NVIDIA Titan Xp GPUs.
c.4 Hyperparameter Specifications
Method  Dataset  Learning rate  Dropout rate  Weight decay factor  Hidden dimension  # of layers  
AIRGCN  Cora  1.1  0.5  0.5  0.01  0.5  0.0005  16  2 
Citeseer  1.1  0.6  0.6  0.01  0.5  0.0005  16  2  
Pubmed  1.1  0.9  0.9  0.01  0.5  0.0005  16  2  
NELL  1.1  0.5  0.5  0.01  0.1  0.00001  64  2  
AIRSAGE  PPI  1.1  0.5  0.5  0.01  0  0  128  2 
Method  Dataset  Hidden dimension 1  Hidden dimension 2  Learning rate  # of layers  
AIRGAE  Cora  1.1  0.5  0.5  32  32  0.01  2 
Citeseer  1  0.4  0.4  64  64  0.01  2  
Pubmed  0.9  0.3  0.1  64  128  0.01  2 
Method  Dataset  Dropout rate  Weight decay factor  Learning rate  # of layers  
AIRGAT  Cora  1.3  0.5  0.8  0.6  0.0005  0.005  2 
Citeseer  1.2  0.5  0.8  0.6  0.0005  0.005  2  
Pubmed  0.8  0.4  0.3  0.6  0.001  0.01  2  
PPI  1.0  0.6  0.6  0.0  0.0  0.005  3 
For fair comparison, we closely follow the same hyperparameters setting as the underlying graph convolutional model, such as learning rate, dropout rate, weight decay factor, hidden dimensions, etc. In the experiments, we only tune the three weights by grid search on the validation set, where . The values of these hyperparameters are summarized in Table 8.
References
 [1] (2013) Representation learning: a review and new perspectives. TPAMI 35 (8), pp. 1798–1828. Cited by: §3.1.
 [2] (1992) A training algorithm for optimal margin classifiers. In COLT, pp. 144–152. Cited by: §1.
 [3] (2014) Spectral networks and locally connected networks on graphs. In ICLR, Cited by: §5.
 [4] (2019) Supervised community detection with line graph neural networks. In ICLR, Cited by: §1.
 [5] (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS, pp. 3844–3852. Cited by: §5.
 [6] (2015) Convolutional networks on graphs for learning molecular fingerprints. In NIPS, pp. 2224–2232. Cited by: §5.
 [7] (1994) Fundamentals of neural networks: architectures, algorithms, and applications. Cited by: §5.
 [8] (2018) Largescale learnable graph convolutional networks. In KDD, pp. 1416–1424. Cited by: §5.
 [9] (2010) Understanding the difficulty of training deep feedforward neural networks. In AISTATS, pp. 249–256. Cited by: footnote 1.
 [10] (2017) Inductive representation learning on large graphs. In NIPS, pp. 1024–1034. Cited by: §1, §4.2, §5, §5, footnote 2.
 [11] (1988) A combined corner and edge detector. In Alvey Vision Conference, pp. 147–151. Cited by: §1.
 [12] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §2.3.
 [13] (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks 4 (2), pp. 251–257. Cited by: §5.
 [14] (2016) Variational graph autoencoders. In Bayesian Deep Learning Workshop (NIPS 2016), Cited by: §4.1, §4.3.
 [15] (2017) Semisupervised classification with graph convolutional networks. In ICLR, Cited by: §1, §1, §2.2, §4.1, §4.2, §5.
 [16] (1989) Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (4), pp. 541–551. Cited by: §1.
 [17] (2018) Visualizing the loss landscape of neural nets. In NeurIPS, pp. 6389–6399. Cited by: §2.3.

[18]
(2018)
Deeper insights into graph convolutional networks for semisupervised learning
. In AAAI, pp. 3538–3545. Cited by: Remark.  [19] (2017) Geometric deep learning on graphs and manifolds using mixture model cnns. In CVPR, pp. 5425–5434. Cited by: §5.
 [20] (1998) Artificial neural networks: theory and applications. Cited by: §5.
 [21] (2014) DeepWalk: online learning of social representations. In KDD, pp. 701–710. Cited by: §4.2, §4.3.
 [22] (1996) Neural Networks: A Systematic Introduction. Cited by: §4.2.
 [23] (2011) Leveraging social media networks for classification. DMKD 23 (3), pp. 447–478. Cited by: §4.3.
 [24] (2018) Graph attention networks. In ICLR, Cited by: §1, §1, §2.2, §4.1, §4.2, §4.2, §5.
 [25] (2019) Simplifying graph convolutional networks. In ICML, pp. 6861–6871. Cited by: §4.2, §5.
 [26] (2019) How powerful are graph neural networks?. In ICLR, Cited by: §4.2, §5.
 [27] (2018) Representation learning on graphs with jumping knowledge networks. In ICML, pp. 5453–5462. Cited by: §1.
 [28] (2016) Revisiting semisupervised learning with graph embeddings. In ICML, pp. 40–48. Cited by: §4.1, §4.2.
 [29] (2018) Link prediction based on graph neural networks. In NIPS, pp. 5167–5177. Cited by: §1.
 [30] (2018) Dual graph convolutional networks for graphbased semisupervised classification. In WWW, pp. 499–508. Cited by: §1, §5.
Comments
There are no comments yet.