GNN is a kind of neural networks that performs neural network operations over graph structure to learn node representations. Owing to the ability to learn more comprehensive node representations than the models that consider only node features Yang et al. (2016) or graph structure Perozzi et al. (2014), GNN has been a promising solution for a wide range of applications in social science Chen et al. (2018), applied chemistry Liao et al. (2019)et al. (2019) et al. (2019), and recommendation Wang et al. (2019) etc. To date, most graph convolution operations in GNNs are implemented as a linear aggregation (i.e., weighted sum) over features of the neighbors of the target node Kipf and Welling (2017). Although it improves the representation of the target node, such linear aggregation assumes that the neighbor nodes are independent of each other, ignoring the possible interactions between them.
Under some circumstances, the interactions between neighbor nodes could be a strong signal that indicates the characteristics of the target node. Figure 1
(left) illustrates a toy example of a target node and its neighbors in a transaction graph, where edges denote money transfer relations and nodes are described by a set of features such as age and income. The interaction between node 1 and 2, which indicates that both have high incomes, could be a strong signal to estimate the credit rating of the target node (an intuition is that a customer who has close business relations with rich friends would have a higher chance to repay a loan). Explicitly modeling such interactions between neighbors highlights the common properties within the local structure, which could be rather helpful for the target node’s representation. In Figure1 (right), we show that the summation-based linear aggregator — a common choice in existing GNNs — fails to highlight the income feature. In contrast, by using a multiplication-based aggregator that captures node interactions, the signal latent in shared high incomes is highlighted, and as an auxiliary effect, some less useful features are zeroed out.
Nevertheless, it is non-trivial to encode such local node interactions in GNN. The difficulty mainly comes from two indispensable requirements of a feasible graph convolution operation: 1) permutation invariant Xu et al. (2019b), i.e., the output should remain the same when the order of neighbor nodes is changed, so as to ensure the stability of a GNN; and 2) linear complexity Kipf and Welling (2017), i.e., the computational complexity should increase linearly with respect to the number of neighbors, so as to make a GNN scalable on large graphs. To this end, we take inspiration from neural factorization machines He and Chua (2017) to devise a new bilinear aggregator, which explicitly expresses the interactions between every two nodes and aggregates all pair-wise interactions to enhance the target node’s representation.
On this basis, we develop a new graph convolution operator which is equipped with both traditional linear aggregator and the newly proposed bilinear aggregator, and is proved to be permutation invariant. We name the new model as Bilinear Graph Neural Network (BGNN), which is expected to learn more comprehensive representations by considering local node interactions. We devise two BGNN models, named BGCN and BGAT, which are equipped with the GCN and GAT linear aggregator, respectively. Taking semi-supervised node classification as an example task, we evaluate BGCN and BGAT on three benchmark datasets to validate their effectiveness. Specifically, BGCN and BGAT outperform GCN and GAT by 1.6% and 1.5%, respectively. More fine-grained analyses show that the improvements on sparsely connected nodes are more significant, demonstrating the strengths of the bilinear aggregator in modeling node interactions.
The main contributions of this paper are summarized as:
We propose BGNN, a simple yet effective GNN framework, which explicitly encodes the local node interactions to augment conventional linear aggregator.
We prove that the proposed BGNN model has the properties of permutation invariant and linear computation complexity which are of importance for GNN models.
We conduct extensive experiments on three public benchmarks of semi-supervised node classification, validating the effectiveness of the proposed BGNN models.
2 Related work
GNN generalizes traditional convolutional neural networks from Euclidean space to graph domain. According to the format of the convolution operations, existing GNN models can be divided into two categories: spatial GNN and spectral GNNZhang et al. (2018). We separately review the two kinds of models, and refer the mathematical connection between them to Bronstein et al. (2017).
Spectral GNN is defined as performing convolution operations in the Fourier domain with spectral node representations Bruna et al. (2014); Defferrard et al. (2016); Kipf and Welling (2017); Liao et al. (2019); Xu et al. (2019a). Bruna et al. Bruna et al. (2014)
define the convolution over the eigenvectors of graph Laplacian which are viewed as the Fourier basis. Considering the high computational cost of the eigen-decomposition, research on spectral GNN has been focused on approximating the decomposition with different mathematical techniquesDefferrard et al. (2016); Kipf and Welling (2017); Liao et al. (2019); Xu et al. (2019a). For instance, Defferrard et al. (2016) introduce the Chebyshev polynomials with orders of to approximate the eigen-decomposition. In Kipf and Welling (2017), Kipf and Welling simplify this model by limiting
and approximating the largest eigenvalue of Laplacian matrix by. In addition, Liao et al. Liao et al. (2019) employ the Lanczos algorithm to perform a low-rank approximation of the graph Laplacian. Recently, Wavelet transform is introduced to spectral GNN to discard the eigen-decomposition Xu et al. (2019a). However, spectral GNN models are hard to be applied on large graphs such as social networks and recommendation networks. This is because the convolution operations are required to be performed over the whole graph, posing unaffordable memory cost and incapacitating the widely applied batch training.
Spatial GNN instead performs convolution operations directly over the graph structure by aggregating the features from spatially close neighbors to a target node Atwood and Towsley (2016); Hamilton et al. (2017); Kipf and Welling (2017); Veličković et al. (2018); Xu et al. (2018); Xinyi and Chen (2019); Veličković et al. (2019); Xu et al. (2019b). This line of research is mainly focused on developing aggregation methods from different perspectives. For instance, Kipf and Welling Kipf and Welling (2017) propose to use a linear aggregator (i.e., weighted sum) that uses the reverse of node degree as the coefficient. To improve the representation performance, neural attention mechanism is introduced to learn the coefficients Veličković et al. (2018). In addition to aggregating information from directly connected neighbors, augmented aggregators also account for multi-hop neighbors Atwood and Towsley (2016); Xu et al. (2018)
. Moreover, non-linear aggregators are also employed in spatial GNNs such as max poolingHamilton et al. (2017), capsule Veličković et al. (2019)
, and Long Short-Term Memory (LSTM)Hamilton et al. (2017). Furthermore, spatial GNN is extended to graphs with both static and temporal neighbors structure Park and Neville (2019) and representations in hyperbolic space Chami et al. (2019).
However, most existing aggregators (both linear and non-linear ones) forgo the importance of the interactions among neighbors. As built upon the summation operation, by nature, the linear aggregators assume that neighbors are independent. For the non-linear ones, most of them are focused on the property of neighbors at set level (i.e., all neighbors), e.g., the "skeleton" of the neighbors Xu et al. (2019b). Taking one neighbor as the input of a time-step, LSTM-based aggregator could capture the sequential dependency, which might include interactions between nodes. However, it requires a predefined order on neighbor, violating permutation invariant and typically showing weak performance Hamilton et al. (2017). Our work is different from those aggregators in that we explicitly consider pairwise node interactions in a neat and systematic way.
3 Bilinear Graph Neural Network
Let be the graph of interest, where A is the binary adjacency matrix where an element means that an edge exists between node and , and X
is the original feature matrix for nodes that describes each node with a vector of size(a row). We denote the neighbors of node as which stores all nodes that have an edge with , and denote the extended neighbors of node as which contains the node itself. For convenience, we use to denote the degree of node , i.e., , and accordingly . The model objective is to learn a representation vector for each node , such that its characteristics are properly encoded. For example, the label of node can be directly predicted as a function output , without the need of looking into the graph structure and original node features in .
The spatial GNN Veličković et al. (2018) achieves this goal by recursively aggregating the features from neighbors:
where denotes the representation of target node at the -th layer/iteration, is the weight matrix (model parameter) to do feature transformation at the -th layer, and the initial feature representation can be obtained from the original feature matrix X.
The function is typically implemented as a weighted sum with as the weight of neighbor . In GCN Kipf and Welling (2017), is defined as , which is grounded on the Laplacian theories. The recent advance on graph attention network (GAT) Veličković et al. (2018) learns
from data, which has the potential to lead better performance than pre-defined choices. However, a limitation of such weighted sum is that no interactions between neighbor representations are modeled. Although using more powerful feature transformation function such as multi-layer perceptron (MLP)Xu et al. (2019b) can alleviate the problem, the process is rather implicit and ineffective. An empirical evidence is from Beutel et al. (2018), which shows that MLP is inefficient in capturing the multiplication relations between input features. In this work, we propose to explicitly inject the multiplication-based node interactions into function.
3.2 Bilinear Aggregator
As demonstrated in Figure 1, the multiplication between two vectors is an effective manner to model the interactions — emphasizing common properties and diluting discrepant information. Inspired by factorization machines (FMs) Rendle (2010); He and Chua (2017)
that have been intensively used to learn the interactions among categorical variables, we propose a bilinear aggregator which is suitable for modeling the neighbor interactions in local structure:
where is element-wise product; is the target node to obtain representation for; and are node index from the extended neighbors — they are constrained to be different to avoid self-interactions that are meaningless and may even introduce extra noises. denotes the number of interactions for the target node , which normalizes the obtained representation to remove the bias of node degree. It is worth noting that we take the target node itself into account and aggregate information from extended neighbors, which although looks same as GNN, but for different reasons. In GNN, accounting for the target node is to retain its information during layer-wise aggregation, working like the residual learning He et al. (2016). While in BGNN, our consideration is that the interactions between the target node and its neighbors may also carry useful signal. For example, for sparse nodes that have only one neighbor, the interaction between neighbors does not exist, and the interaction between the target and neighbor nodes can be particularly helpful.
Time Complexity Analysis.
At the first sight, the bilinear aggregator considers all pairwise interactions between neighbors (including the target node), thus may have a quadratic time complexity w.r.t. the neighbor count, being higher than the weighted sum. Nevertheless, through a mathematical reformulation similar to that one used in FM, we can compute the aggregator in linear time — — the same complexity as weighted sum. To show this, we rewrite Equation (2) in its equivalent form as:
where . As can be seen, through mathematical reformulation, we can reduce the sum over pairwise element-wise products to the minus of two terms, where each term is a weighted sum of neighbor representations (or their squares) and can be computed in time. Note that multiplying weight matrix W is a standard operation in aggregator thus its time cost is omitted for brevity.
Proof of Permutation Invariant.
This property is intuitive to understand from the reduced Equation (3): when changing the order of input vectors, the sum of inputs (the first term) and the sum of the squares of inputs (the second term) are not changed. Thus the output is unchanged and the permutation invariant property is satisfied. To provide a rigorously proof, we give the matrix form of the bilinear aggregator, which also facilitates the matrix-wise implementation of BGNN. The matrix form of the bilinear aggregator is:
where stores the representation vectors h for all nodes, is the adjacency matrix of the graph with self-loop added on each node (
is an identity matrix),B is a diagonal matrix with each element , and denotes the element-wise product of two matrices.
Let be any permutation matrix that satisfies (1) , and (2) for any matrix M if PM exists, then satisfies. When we apply the permutation P on the nodes, H changes to PH, changes to and B changes to , which leads to:
which indicates the permutation invariant property.
3.3 BGNN Model
We now describe the proposed BGNN model. As the bilinear aggregator emphasizes node interactions and encodes different signal with the weighted sum aggregator, we combine them to build a more expressive graph convolutional network111Note that we do not argue that bilinear aggregator is better than weighted sum. Instead, they have their own pros and cons in aggregating the information from neighbors. Thus, we combine them to make them complement with each other for better expressiveness.. We adopt a simple linear combination scheme, defining a new graph convolution operator as:
where stores the node representations at the -th layer (encoded -hop neighbors). is a hyper-parameter to trade-off the strengths of traditional GNN aggregator and our proposed bilinear aggregator. Figure 2 illustrates the model framework.
Since both and are permutation invariant, it is trivial to find that this graph convolution operator is also permutation invariant. When sets to 0, no node interaction is considered and BGNN degrades to GNN; when sets to 1, BGNN only uses the bilinear aggregator to process the information from the neighbors. Our empirical studies show that an intermediate value between 0 and 1 usually leads to better performance, verifying the efficacy of modeling node interactions, and the optimal setting varies on different datasets.
Traditional GNN models Kipf and Welling (2017); Veličković et al. (2018); Xu et al. (2019b) encode information from multi-hop neighbors in a recursive manner by stacking multiple aggregators. For example, the 2-layer GNN model is formalized as,
is a non-linear activation function. Similarly, we can devise a 2-layer BGNN model in the same recursive manner:
However, such a straightforward multi-layer extension involves unexpected higher-order interactions. In the two-layer case, the second-layer representation will include partial 4th-order interactions among the two-hop neighbors, which are hard to interpret and unreasonable. When extending BGNN to multiple layers, saying layers, we still hope to capture pairwise interactions, but between the -hop neighbors. To this end, instead of directly stacking layers, we define the 2-layer model as:
where stores the 2-hop connectivities of the graph. is an entry-wise operation that transforms non-zero entries to 1. As such, a non-zero entry in means node can reach node within two hops. is a hyper-parameter to trade-off the strengths of bilinear interactions within 1-hop neighbors and 2-hop neighbors.
Following the same principle, we define the -layer BGNN as:
where denotes the adjacency matrix of -hop connectivities, and denotes normal -layer GNN that can be defined recursively such as a -layer GCN or GAT. The time complexity of a -layer BGNN is determined by the number of non-zero entries in . To reduce the actual complexity, one can follow the sampling strategy in GraphSage Hamilton et al. (2017), sampling a portion of high-hop neighbors rather than using all neighbors.
BGNN is a differentiable model, thus it can be end-to-end optimized on any differential loss with gradient descent. In this work, we focus on the semi-supervised node classification task, optimizing BGNN with the cross-entropy loss on labeled nodes (same setting as the GCN work Kipf and Welling (2017) for a fair comparison). As the experimented data is not large, we implement the layer-wise graph convolution in its matrix form, leaving the batch implementation and neighbor sampling which can scale to large graphs as future work.
|BGAT-T||79.6 0.6||71.4 1.3||-||79.8 0.3||84.2 0.4||74.0 0.3||-|
The performance of SemiEmb, DeepWalk, Plantoid and GCN (2-layer) are directly copied from Kipf and Welling (2017), and the performance of GAT (2-layer) is copied from Veličković et al. (2018) since we follow the same data split and parameter settings. For the other methods, we report the mean and standard deviation of 10 different runs. RI means the average relative improvement achieved by BGAT-T. We omit the models with more layers for the consideration of over-smoothing issue
since we follow the same data split and parameter settings. For the other methods, we report the mean and standard deviation of 10 different runs. RI means the average relative improvement achieved by BGAT-T. We omit the models with more layers for the consideration of over-smoothing issueLi et al. (2018).
. In these datasets, nodes and edges represent documents and citation relations between documents, respectively. Each node is represented by the bag-of-words features extracted from the content of the document. Each node has a label with one-hot encoding of the document category. We employ the same data split in previous worksKipf and Welling (2017); Yang et al. (2016); Veličković et al. (2018). That is, 20 labeled nodes per class are used for training. 500 nodes and 1000 nodes are used as validation set and test set, respectively. Note that the train process can use all of the nodes’ features. For this data split, we report the average test accuracy over ten different random initializations. To save space, we refer Kipf and Welling (2017) for the detailed statistics of the three datasets.
We compare against the strong baselines mainly in two categories: network embedding and GNN. We select three widely used network embedding approaches: graph regularization-based network embedding (SemiEmb) Weston et al. (2012) and skip-gram-based graph embedding (DeepWalk Perozzi et al. (2014) and Planetoid Yang et al. (2016)). For GNNs, we select GCN Kipf and Welling (2017), GAT Veličković et al. (2018) and Graph Isomorphism Network (GIN) Xu et al. (2019b).
We devise two BGNNs which implement the function as GCN and GAT, respectively. For each BGNN, we compare two variants with different scopes of the bilinear interactions: 1) BGCN-A and BGAT-A which consider all nodes within the -hop neighbourhood, including the target node in the bilinear interaction. 2) BGCN-T and BGAT-T, which consider the interactions between the target node and the neighbor nodes within its -hop neighbourhood.
We closely follow the GCN work Kipf and Welling (2017) to set the hyper-parameters of SemiEmb, DeepWalk, and Planetoid. We perform grid-search to select the optimal values for hyper-parameters of the remaining methods, including the dropout rate, the weight for -norm (), the trade-off the aggregated information from multi-hop nodes, and the that balances the linear aggregator and bilinear aggregator. The dropout rates, , and are selected within , , and
, respectively. All BGNN-based models are trained for 2,000 epochs with an early stopping strategy based on both convergence behavior and accuracy of the validation set.
4.1 Performance Comparison
Table 1 shows the performance of the compared methods on the three datasets w.r.t. prediction accuracy on the data split exactly same as in Kipf and Welling (2017). From the table, we have the following observations:
In all cases, the proposed BGNN models achieves the best performance with average improvements over the baselines larger than 1.5%. The results validate the effectiveness of BGNN which is attributed to incorporating the pairwise interactions between the nodes in the local structure (i.e., the ego network of the target node) when performing graph convolution.
On average, BGAT (BGCN) outperforms vanilla GAT (GCN) by 1.5% (1.6%). These results further indicate the benefit of considering the interaction between neighbor nodes, which could augment the representation of a target node, facilitating its classification. Furthermore, the improvements of BGAT and BGCN in the 1-layer and 2-layer settings are close, which indicates that the interactions between both 1-hop neighbors and 2-hop neighbors are helpful for the representation of a target node.
BGNN models, which have different scopes of the bilinear interactions, achieve different performance across datasets. In most cases, BGAT-T (BGCN-T) achieves performance better than BGAT-A (BGCN-A), signifying the importance of interactions with the target node.
Among the baselines, GCN models perform better than embedding-based methods, indicating the effectiveness of graph convolution operation in learning node representations. GAT models perform better than GCN models. These results are consistent with findings in previous works Kipf and Welling (2017); Yang et al. (2016); Veličković et al. (2018).
|BGCN-T||77.9 1.1||80.3 1.1||71.6 1.1|
|BGCN-T||79.4 0.1||82.0 0.1||71.9 0.0|
We perform paired t-test between the results of BGCN-T and GCN. The significance level on Pubmed, Cora, and Citeseer are at
We perform paired t-test between the results of BGCN-T and GCN. The significance level on Pubmed, Cora, and Citeseer are at, , and , respectively.
As reported in Kipf and Welling (2017) (Table 2), the performance of GCN on random data splits is significantly worse than the fixed data split. As such, following Wu et al. (2019), we also test the methods on 10 random splits of the training set while keeping the validation and test sets unchanged. Table 2 shows the performance of BGCN-T and GCN over random splits. To save space, we omit the results of BGCN-A and BGAT-based models which show similar trends. As can be seen, BGCN-T still outperforms GCN with high significant level (), which further validates the effectiveness of the proposed model. However, the performance of both BGCN-T and GCN suffers from random data split as compared to the fixed data split. This result is consistent with previous work Wu et al. (2019) and reasonable since the hyper-parameters are tuned on the fixed data split.
4.2 Study of BGNN
Impacts of Bilinear Aggregator.
As the BA is at the core of BGNN, we first investigate its impacts on the performance by varying the value of . Note that larger means more contributions from the BA; BGNN will downgrade to vanilla GNN with only the linear aggregator by setting , while being fully dependent on the BA by setting . Figures 3(a) and 3(b) show the performance of BGCN-T with 1-layer and 2-layer on Cora and Citeseer datasets, respectively. We omit the results of other BGCN-based and BGAT-based models and results on Pubmed for saving space, which show similar trends. We have the following observations: 1) Under the two settings (1-layer and 2-layer), the performance of BGCN-T varies in a range from 67.5 to 82.1. It suggests a careful tuning of would make our models achieve desired performance. 2) BGCN-T outperforms vanilla GCN in most cases. It again verifies that the BA is capable of capturing the complex patterns of information propagation, which are hard to reveal by the linear aggregator individually. 3) Surprisingly, the performance of BGCN-T with is much worse than the performance when is set to the optimal value. One possible reason is that the BA mainly serves as the complementary component to the linear aggregator, hardly working alone to achieve the comparable performance.
Impacts of Multi-Hop Neighbors.
We also study the effects of , in order to explore such trade-off between the aggregated information from different hops. Note that setting as 0 and 1 denotes the individual modeling of one- and two-hop neighbors, respectively. As Figure 3 shows, we observe that involving the pairwise interactions from one- and two-hop neighbors simultaneously achieves better performance. It again verifies the effectiveness of stacking more BAs.
4.3 In-Depth Analysis of Aggregators
We perform in-depth analysis of different aggregators to clarify their working mechanism with respect to two node characteristics — 1) Degree, which denotes the average numbers of (one- and two-hop) neighbors surrounding the target node, and 2) Ratio, we first count the number of (one- and two-hop) neighbors which have the same label with the target node, and then divide this number by the number of all one- and two-hop neighbors. We summarize our statistical results in Table 3, wherein the symbol and
denote whether the target nodes are correctly classified or misclassified, respectively. Jointly analyzing the three categories corresponding to the three rows in Table3, we have the following findings: 1) Focusing on the third category with the least degree, BGCN-T consistently outperforms GCN, suggesting that the bilinear aggregator is able to distill useful information from sparser neighbors. 2) Comparing the third category to the second one, we observe that BGCN-T is able to endow the predictor node denoising ability. That is, BGCN-T can effectively aggregate information from the neighbors with consistent labels, filtering out the useless information from the irrelevant neighbors. 3) We also realize the limitations of BGCN-T from the second category — the bilinear interaction might need more label-consistent neighbors (aka. larger ratio), when more neighbors are involved (aka. larger degree).
In this paper, we proposed BGNN, a new graph neural network framework, which augments the expressiveness of vanilla GNN by considering the interactions between neighbor nodes. The neighbor node interactions are captured by a simple but carefully devised bilinear aggregator. The simpleness of the bilinear aggregator makes BGNN have the same model complexity as vanilla GNN w.r.t. the number of learnable parameters and analytical time complexity. Furthermore, the bilinear aggregator is proved to be permutation invariant which is an important property for GNN aggregators Hamilton et al. (2017); Xu et al. (2019b). We applied the proposed BGNN on the semi-supervised node classification task, achieving state-of-the-art performance on three benchmark datasets.
In future, we plan to explore the following research directions: 1) encoding high-order interactions among multiple neighbors, 2) exploring the effectiveness of deeper BGNNs with more than two layers, and 3) developing AutoML technique Feurer et al. (2015) to adaptively learn the optimal and for each neighbor.
-  (2016) Diffusion-convolutional neural networks. In NeurIPS, pp. 1993–2001. Cited by: §2.
-  (2018) Latent cross: making use of context in recurrent recommender systems. In WSDM, pp. 46–54. Cited by: §3.1.
Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Mag 34 (4), pp. 18–42. Cited by: §2.
-  (2014) Spectral networks and locally connected networks on graphs. ICLR. Cited by: §2.
-  (2019) Hyperbolic graph convolutional neural networks. In NeurIPS, pp. 4869–4880. Cited by: §2.
-  (2018) Fastgcn: fast learning with graph convolutional networks via importance sampling. ICLR. Cited by: §1.
-  (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, pp. 3844–3852. Cited by: §2.
Efficient and robust automated machine learning. In NeurIPS, pp. 2962–2970. Cited by: §5.
-  (2017) Inductive representation learning on large graphs. In NeurIPS, pp. 1024–1034. Cited by: §2, §2, §3.3, §5.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.2.
-  (2017) Neural factorization machines for sparse predictive analytics. In SIGIR, pp. 355–364. Cited by: §1, §3.2.
Rethinking knowledge graph propagation for zero-shot learning. In CVPR, pp. 11487–11496. Cited by: §1.
-  (2017) Semi-supervised classification with graph convolutional networks. ICLR. Cited by: §1, §1, §2, §2, §3.1, §3.3, §3.3, 4th item, §4, §4, §4, §4.1, §4.1, Table 1.
Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI, pp. 3538–3545. Cited by: Table 1.
-  (2019) LanczosNet: multi-scale deep graph convolutional networks. In ICLR, Cited by: §1, §2.
-  (2019) Exploiting interaction links for node classification with deep graph neural networks. In IJCAI, pp. 3223–3230. Cited by: §2.
-  (2014) Deepwalk: online learning of social representations. In KDD, pp. 701–710. Cited by: §1, §4.
-  (2010) Factorization machines. In ICDM, pp. 995–1000. Cited by: §3.2.
-  (2008) Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.
-  (2018) Graph attention networks. ICLR. Cited by: §2, §3.1, §3.1, §3.3, 4th item, §4, §4, Table 1.
-  (2019) Deep graph infomax. In ICLR, Cited by: §2.
-  (2019) Neural graph collaborative filtering. In SIGIR, pp. 165–174. Cited by: §1.
-  (2012) Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639–655. Cited by: §4.
-  (2019) Simplifying graph convolutional networks. ICML, pp. 6861–6871. Cited by: §4.1.
-  (2019) Capsule graph neural network. In ICLR, Cited by: §2.
-  (2019) Graph wavelet neural network. In ICLR, Cited by: §2.
-  (2019) How powerful are graph neural networks?. ICLR. Cited by: §1, §2, §2, §3.1, §3.3, §4, §5.
-  (2018) Representation learning on graphs with jumping knowledge networks. ICML, pp. 8676–8685. Cited by: §2.
-  (2016) Revisiting semi-supervised learning with graph embeddings. ICML, pp. 86–94. Cited by: §1, 4th item, §4, §4.
-  (2019) Graph convolutional networks for text classification. In AAAI, pp. 7370–7377. Cited by: §1.
-  (2018) Deep learning on graphs: a survey. arXiv preprint arXiv:1812.04202. Cited by: §2.