Node-wise Localization of Graph Neural Networks

Graph neural networks (GNNs) emerge as a powerful family of representation learning models on graphs. To derive node representations, they utilize a global model that recursively aggregates information from the neighboring nodes. However, different nodes reside at different parts of the graph in different local contexts, making their distributions vary across the graph. Ideally, how a node receives its neighborhood information should be a function of its local context, to diverge from the global GNN model shared by all nodes. To utilize node locality without overfitting, we propose a node-wise localization of GNNs by accounting for both global and local aspects of the graph. Globally, all nodes on the graph depend on an underlying global GNN to encode the general patterns across the graph; locally, each node is localized into a unique model as a function of the global model and its local context. Finally, we conduct extensive experiments on four benchmark graphs, and consistently obtain promising performance surpassing the state-of-the-art GNNs.



page 6


Label-Consistency based Graph Neural Networks for Semi-supervised Node Classification

Graph neural networks (GNNs) achieve remarkable success in graph-based s...

Learn Locally, Correct Globally: A Distributed Algorithm for Training Graph Neural Networks

Despite the recent success of Graph Neural Networks (GNNs), training GNN...

GNN-LM: Language Modeling based on Global Contexts via GNN

Inspired by the notion that “to copy is easier than to memorize“, in thi...

GBK-GNN: Gated Bi-Kernel Graph Neural Networks for Modeling Both Homophily and Heterophily

Graph Neural Networks (GNNs) are widely used on a variety of graph-based...

Decoupling the Depth and Scope of Graph Neural Networks

State-of-the-art Graph Neural Networks (GNNs) have limited scalability w...

Learnable Structural Semantic Readout for Graph Classification

With the great success of deep learning in various domains, graph neural...

GraphReach: Locality-Aware Graph Neural Networks using Reachability Estimations

Analyzing graphs by representing them in a low dimensional space using G...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are powerful data structures to model various entities (i.e., nodes) and their interactions (i.e., edges) simultaneously. To learn the representations of nodes on a graph, graph neural networks (GNNs) [Wu et al.2020] have been proposed as a promising solution. Generally, state-of-the-art GNNs [Kipf and Welling2017, Hamilton et al.2017, Veličković et al.2018] learn the representation of each node by recursively transferring and aggregating information from its receptive field, which is often defined as its set of neighboring nodes. Consider a toy citation graph in Fig. 1(a), consisting of papers in three areas: biology (bio), bioinformatics (bioinf) and computer science (cs). To derive the representation of a node, say , GNNs aggregate keyword features from not only itself, but also its neighbors, namely and , as illustrated in Fig. 1(b). Such neighborhood aggregation can be performed recursively for the neighbors as well in more layers, to fully exploit the graph structures. To extract useful representations from the neighbors, GNNs aim to learn the model parameters formulated as a sequence of weight matrices in each layer.

Figure 1: Comparison of conventional GNNs and the proposed localization (best viewed in color).

However, the implicit assumption of a global weight matrix (in each layer) for the entire graph is often too strict. Nodes do not distribute uniformly over the graph, and are associated with different local contexts. For instance, in Fig. 1(c), node is associated with a bio context, with a bioinf context and with a cs context. Different local contexts are characterized by different keyword features, such as “gene” and “cell” in bio, “gene” and “svm” in bioinf, as well as “svm” and “vc-dim” in cs. Thus, a graph-level global weight matrix is inherently inadequate to express the varying importance of features at different localities. More specifically, a global weight matrix can be overly diffuse, for different nodes often have distinct optimal weight matrices, which tend to pull the model in many opposing directions. This may result in a biased model that centers its mass around the most frequent patterns while leaving others not well covered. A natural question follows: Can we allow each node to be parameterized by its own weight matrix? Unfortunately, this is likely to cause severe overfitting to local noises and suffer from significantly higher training cost.

In this work, to adapt to the local context of each node without overfitting, we localize the model for each node from a shared global model. As illustrated in Fig. 1(c), a localized weight matrix for each node can be derived from both the global weight and the local context of . That is, is a function of the global information and the local context. Globally, all nodes depend on a common underlying model to encode the general patterns at the graph level. Locally, each node leverages its local context to personalize the common model into its unique localized weight at the node level. It is also useful to incorporate a finer-grained localization at the edge level, where information received through each edge of the target node can be further adjusted. The proposed localization, which we call node-wise localized GNN (LGNN), aims to strike a balance between the global and local aspects of the graph. Moreover, LGNN is agnostic of the underlying global model, meaning that it is able to localize any GNN that follows the paradigm of recursive neighborhood aggregation, and subsumes various conventional GNNs as its limiting cases.

In summary, our main contributions are three-fold. (1) We identify the need to adapt to the local context of each node in GNNs. (2) We propose a node-wise localization of GNNs to capture both global and local information on the graph, and further discuss its connection to other works. (3) We conduct extensive experiments on four benchmark datasets and show that LGNN consistently outperforms prior art.

2 Proposed Approach

In this section, we introduce the proposed approach LGNN, and discuss its connection to other works.

2.1 General Formulation of Localization

We start with an abstract definition of localizing a global model to fit local contexts. Specifically, the localized model of an instance is a function of both the global model and the local context. Let denote the global model, and the local context of . Then the localized model for is


where can be any function such as a neural network. While we focus on graph data where is a node, the general formulation is also pertinent to other kinds of data when there are varying, non-uniform contexts associated with instances. In particular, on a graph with a set of nodes and edges , the local context of a node can be materialized as the neighbors of , in addition to itself. That is,


2.2 Localization of GNNs

A typical GNN consists of multiple layers of recursive neighborhood aggregation. In the -th layer, each node

receives and aggregates information from its neighbors to derive its hidden representation

, with being the dimension of the -the layer representation. Note that the initial representation

is simply the input feature vector of

. The aggregation is performed on the local context of which also covers the neighbors of , as follows.


where is a weight matrix in the -th layer,

is an activation function, and

Aggr is an aggregation function. Different GNNs differ in the choice of the aggregation function. For instance, GCN [Kipf and Welling2017] uses an aggregator roughly equivalent to mean pooling [Hamilton et al.2017], GAT [Veličković et al.2018] uses an attention-weighted mean aggregator, and GIN [Xu et al.2019]

uses a multi-layer perceptron (MLP).

Node-level localization.

In the above setup, we have a global model for all nodes in the -th layer. At the node level, the global model is localized to adapt to the local context of each node. Specifically, we modulate the global weight by a node-specific transformation through scaling and shifting. The localized weight matrix of node is given by


where are -specific vectors for scaling and shifting, respectively. Here the notation represents a matrix of columns all of which are identical to the vector , and denotes element-wise multiplication. We essentially transform each row of the global matrix by and in an element-wise manner, to generate the localized weight , so that various importance levels are assigned to each feature dimension of the node embedding .

It is important to recognize that the node-specific transformation should not be directly learned, for two reasons. First, directly learning them significantly increases the number of model parameters especially in the presence of a large number of nodes, causing overfitting. Second, node-wise localization is a function of both the global model and the local context as formulated in Eq. (1), in order to better capture the local information surrounding each node. Thus, for node , we propose to generate the -specific transformation in the -th layer from its local contextual information . A straightforward recipe for is to pool the -th layer representations of the nodes in the local context :


in which we adopt the mean pooling although other forms of pooling can be substituted for it. Given the local contextual information , we further utilize a dense layer, shared by all nodes, to generate the -specific transformation as follows.


where are learnable parameters shared by all nodes, and is a vector of ones to ensure that the scaling factors are centered around one. Note that if the dimension

is too large, such as that of the high-dimensional raw features handled in the first layer of a GNN, we could employ two dense layers and set the first one with a smaller number of neurons.

Edge-level localization.

The node-level localization of the global weight matrix is coarse-grained, as the same localized weight is applied on all neighbors of the target node. That is, the target node receives information through each of its edges uniformly. To enable a finer-grained localization, we further modulate how information propagates at the edge level.

Consider a node with edges . In the -th layer of the GNN, let denote the local contextual information of the edge , which is given by its two ends. Specifically, we concatenate the -th layer representations of nodes and , as follows.


To implement the edge-level localization, we similarly adopt an edge-specific transformation through scaling and shifting. Specifically, to derive the -th layer representation of , we scale and shift the information from each edge during aggregation, by rewriting Eq. (3) as follows.


where are edge -specific vectors for scaling and shifting, respectively. Like the node-specific transformation, we generate the edge-specific transformation with a dense layer given by


where are learnable parameters shared by all edges.

2.3 Semi-supervised Node Classification

In the benchmark task of semi-supervised node classification, each node belongs to one of the pre-defined classes . However, we only know the class labels of a subset of nodes , called the labeled nodes. The goal is to predict the class labels of the remaining nodes. Following standard practice [Kipf and Welling2017], for -way classification we set the dimension of the output layer to , and apply a softmax function to the representation of each node. That is, given a total of layers,


For a labeled node , let be 1 if and only if node belongs to class . The overall loss is then formulated using a cross-entropy loss with regularization as


where contains the parameters of the global model, contains the parameters of localization, and , are sets respectively containing the transformation vectors for scaling (’s and ’s) and shifting (’s and ’s). Note that the norms

are computed over all tensor elements in the sets. While

and are not learnable, they are functions of the learnable . We explicitly constrain them to favor small local changes, i.e., close-to-one scaling and close-to-zero shifting. Moreover, as and grow with the size of the graph, their norms are further normalized by the total number of tensor elements in them, denoted by and , respectively. , and

are non-negative hyperparameters.

2.4 Connections to Other Works

Next, we discuss how LGNN is related to other lines of work in GNNs, network embedding and hypernetworks.

Relationship with existing GNNs.

The proposed LGNN is able to localize any GNN model that follows the scheme of recursive neighborhood aggregation in Eq. (3). In particular, LGNN is a generalized GNN that subsumes, as special limiting cases, several conventional GNNs by adopting an appropriate aggregation function and regularization.

Specifically, as in Eq. (2.3), the scaling and shifting would approach 1’s and 0’s in the limiting case, respectively. This is equivalent to removing all node-wise localization at both node and edge levels, where all nodes assume a global model. Thus, LGNN would be asymptotically equivalent to GCN [Kipf and Welling2017] and GIN [Xu et al.2019], when using a graph convolution aggregator (roughly equivalent to the mean pooling) and an MLP aggregator on the sum-pooled representations, respectively. Furthermore, with an appropriate regularization, LGNN can be asymptotically reduced to GAT111Here we only discuss GAT with one attention head. [Veličković et al.2018]. Consider the vectors for scaling in , which can be split into the node-specific (containing ’s) and edge-specific (containing ’s). For , we maintain the same regularization ; for , we adopt the regularization such that contains the corresponding attention coefficients on all edges. That is,


where is the attention coefficient on edge . Coupled with a mean aggregator, we obtain GAT as the limiting case when . Alternatively, we can also maintain the same regularization for both node- and edge-specific transformations, and adopt a weighted mean aggregator in accordance to the attention coefficients. Note that in GAT the attention coefficient on each edge is a scalar, whereas in our edge-level localization, the edge-specific scaling is represented by a vector, which is able to flexibly vary the contribution from different dimensions of the information received through the edges.

Relationship with network embedding.

Network embedding [Cai et al.2018] is a popular alternative to GNNs to learn node representations on a graph. In these methods, each node is directly encoded with a learnable parameter vector, which is taken as the representation of the node. In contrast, in GNNs, the representations of the nodes are derived from a shared learnable model. Thus, network embedding can be viewed as a set of individual local models that are loosely coupled by local graph structures, whereas GNNs capture a more global view of the structures. In contrast, the proposed LGNN achieves a balance between the local and global views, to allow localized variations grounded on a global model.

Relationship with hypernetworks.

Our localization strategy can be deemed a form of hypernetworks [Ha et al.2017], which use a secondary neural network to generate weights for the target network. In our case, we employ dense layers as a secondary network to generate the vectors for scaling and shifting, which are leveraged to ultimately generate the localized weights for the target GNN. Moreover, our approach boils down to the feature-wise modulation of information received from the conditioning local context (i.e., neighboring nodes) at both node and edge levels, which is inspired by the feature-wise modulation of neural activations in FiLMs [Perez et al.2018, Birnbaum et al.2019].

On graphs, GNN-FiLM [Brockschmidt2020] also uses a form of FiLM on GNNs. However, there are several fundamental differences between GNN-FiLM and our LGNN. First, in terms of problem and motivation, GNN-FiLM aims to model edge labels on relational networks for message passing. In contrast, our LGNN aims to model local contexts for node-wise localization, and works on general graphs. Second, in terms of model, GNN-FiLM models relation-specific transformations conditioned only on a node’s self-information, which does not sufficiently reflect the full local context of the node as LGNN is conditioned on. Furthermore, when there is no edge labels, GNN-FiLM reduces to a “uniform edge” model. In LGNN, we still have both node and edge level modulation to achieve localization. Third, in terms of empirical performance, as discussed therein [Brockschmidt2020], GNN-FiLM only achieves at best comparable performance to existing GNNs on citation networks (such as Cora and Citeseer) where there is no edge label. Our own experiments also have reproduced similar results in the next section.

3 Experiments

In this section, we evaluate and analyze the empirical performance of our proposed approach LGNN.

3.1 Experimental Setup222Additional implementation details and experimental settings are included in Sections A and B of the supplemental material.


We utilize four benchmark datasets. They include two academic citation networks, namely Cora and Citeseer [Yang et al.2016], in which nodes correspond to papers and edges correspond to citations between papers. The input node features are bag-of-word vectors, indicating the presence of each keyword. A similar citation network for Wikipedia articles called Chameleon [Rozemberczki et al.2021] is also used. Finally, we use an e-commerce co-purchasing network called Amazon [Hou et al.2020], in which the nodes correspond to computer products and the edges correspond to co-purchasing relationships between products. The input node feature vectors are constructed from product images. The statistics of the datasets are summarized in Table 1.

Dataset # Nodes # Edges # Classes # Features
Cora 2,708 5,429 7 1,433
Citeseer 3,327 4,732 6 3,703
Amazon 13,381 245,778 10 767
Chameleon 2,277 36,101 5 2,325
Table 1: Summary of datasets.
Methods # Params Cora Citeseer Amazon Chameleon
(Cora) Accuracy Micro-F Accuracy Micro-F Accuracy Micro-F Accuracy Micro-F
DeepWalk 693K 73.80.3 74.90.1 61.60.2 60.51.0 80.11.6 77.31.3 41.21.3 40.11.1
Planetoid 345K 66.10.4 64.50.5 64.50.3 62.90.4 69.81.7 64.51.5 39.31.8 37.71.7
GCN 11K 81.50.7 80.80.5 70.40.5 68.30.7 81.90.5 81.00.8 46.74.3 46.42.4
GCN-64 92K 82.00.3 80.90.3 71.10.3 69.20.4 82.10.5 81.20.8 48.33.3 46.31.8
GCN-96 138K 81.90.2 80.80.3 71.30.4 69.40.5 82.20.4 81.50.7 45.52.4 43.82.5
GCN-FiLM 35K 78.10.6 76.90.5 69.81.1 67.91.0 79.21.0 77.11.5 42.81.1 39.91.3
LGCN 104K 83.50.3 82.10.4 72.20.4 70.20.4 83.71.5 82.32.0 50.91.1 49.70.7
(improv.) - (1.8%) (1.5%) (1.3%) (1.2%) (1.8%) (1.0%) (5.4%) (7.1%)
GAT 92K 82.90.6 82.00.6 72.40.7 70.40.8 82.41.3 80.11.9 47.21.1 46.22.1
GAT-64 738K 83.10.4 81.90.6 71.61.5 69.81.6 83.00.9 81.21.4 51.21.5 50.21.3
GAT-96 1108K 83.20.6 81.90.6 71.40.9 69.60.9 83.11.0 81.51.4 51.91.2 50.21.8
GAT-FiLM 277K 82.00.5 80.60.6 71.21.0 69.21.1 83.30.6 81.90.8 46.85.7 45.15.2
LGAT 836K 83.60.4 82.30.4 72.80.4 70.80.5 83.70.7 82.30.8 52.61.0 51.10.9
(improv.) - (0.5%) (0.4%) (0.6%) (0.6%) (0.5%) (0.5%) (1.3%) (1.8%)
GIN 11K 80.20.5 78.80.3 68.50.7 66.51.0 79.61.7 78.52.6 45.83.0 41.24.0
GIN-64 92K 80.31.1 79.11.0 67.81.5 66.11.1 79.81.1 79.01.4 45.74.5 40.75.7
GIN-96 138K 79.91.1 78.91.0 68.61.4 66.61.6 80.22.1 79.03.2 45.93.5 41.54.1
GIN-FiLM 35K 79.80.7 78.50.5 67.71.4 65.81.5 78.62.8 77.23.3 38.82.6 34.22.9
LGIN 126K 82.60.8 81.60.8 71.30.4 69.50.5 84.01.2 82.71.7 48.31.9 47.31.9
(improv.) - (2.9%) (3.2%) (3.9%) (4.4%) (4.7%) (4.7%) (5.2%) (14.0%)
Table 2:

Average classification performance with standard deviation (percent) over 10 runs. Improvements of LGNN are relative to the best performing baseline with the corresponding GNN architecture.

Baselines and our approach.

We compare against three categories of baselines. (1) Embedding models: DeepWalk [Perozzi et al.2014] and Planetoid [Yang et al.2016]

. Both employ a direct embedding lookup and adopt random walks to sample node pairs. However, DeepWalk is unsupervised such that node classification is performed as a downstream task on the learned representations, whereas Planetoid is an end-to-end model that jointly learns the representations and the classifier. (2) GNN models:

GCN [Kipf and Welling2017], GAT [Veličković et al.2018] and GIN [Xu et al.2019]. (3) GNN-FiLM: The original GNN-FiLM works with a GCN-style model, which we call GCN-FiLM. We further extend it to two other GNN architectures GAT and GIN, resulting in two more versions GAT-FiLM and GIN-FiLM.

On the other hand, our approach LGNN can also be implemented on different GNN architectures. Specifically, we employ GCN, GAT and GIN as the global model, and obtain localized versions LGCN, LGAT and LGIN, respectively.

Main settings.

For DeepWalk, we sample 10 random walks per node with walk length 100 and windows size 5, and set the embedding dimension to 128. We further train a logistic regression as the downstream classifier. For Planetoid, we set the path length to 10, window size to 3 and the embedding dimension to 50. We select the best results from their transductive and inductive versions for report. For all GNN-based approaches, we adopt two aggregation layers, noting that deeper layers often bring in noises from distant nodes

[Pei et al.2020] and cause “over-smoothing” such that all nodes obtain very similar representations [Li et al.2018, Xu et al.2018]. The dimension of the hidden layer defaults to 8, while we also present results using larger dimensions. The regularization of GNN parameters is set to . These settings are chosen via empirical validation, and are largely consistent with the literature [Perozzi et al.2014, Kipf and Welling2017, Veličković et al.2018]. For our models, the additional regularizations are set to (except for in LGAT).

Training and testing.

For all datasets, we follow the standard split in the literature [Yang et al.2016, Kipf and Welling2017, Veličković et al.2018], which uses 20 labeled nodes per class

for training, 500 nodes for validation and 1000 nodes for testing. We repeat 10 runs for each experiment, and report the average accuracy and micro-F score.

Figure 2: Ablation study on the effect of each localization module.

3.2 Performance Comparison

Table 2 shows the comparison of the classification performance on the four datasets. For each GNN architecture, we also include additional models by increasing the dimension of its hidden layer. For instance, GCN-64 denotes that the hidden layer of GCN has 64 neurons. The results reveal a few important observations.

Firstly, we observe that LGNN consistently achieves significant performance boosts w.r.t. three state-of-the-art GNNs on four datasets. This suggests that, our localization approach is agnostic of the particular GNN architecture of the global model, and can be used universally on different GNNs. Furthermore, LGNNs also outperform GNN-FiLMs, as GNN-FiLMs only achieve at best comparable performance with the base GNNs. A potential reason is that GNN-FiLM is only conditioned on the target node’s self-information to modulate message passing for different edge labels, which may not be useful on graphs without edge labels. In contrast, LGNN is conditioned on the full local context of the target node to achieve node-wise localization that could be useful to the classification of nodes residing across different localities on the graph. Secondly, GAT-based models generally attain better performance than GCN- and GIN-based models, and the performance gain in LGAT w.r.t. GAT is smaller. This is not surprising as the attention coefficients on the edges can be understood as a limited form of localization to differentiate the weight of different neighbors. Thirdly, the results imply that increasing the number of parameters alone cannot achieve the effect of localization. It may be argued that more parameters lead to higher model capacity, which can potentially assign some space to encode the local aspects. However, when the parameters are still shared globally, there is no explicit node-wise constraint and thus the model will still try to find a middle ground. Nevertheless, for a fair comparison, we use more parameters on each GNN baseline to match or exceed the number of parameters on the corresponding LGNN, as listed in Table 2444Details of the calculation are included in Section C of the supplemental material.. As expected, more parameters only result in marginal improvements to the baselines. Thus, the efficacy of LGNN is derived from our localization strategy rather than just more parameters. Fourthly, LGNN is robust and stable given their relatively small standard deviations in many cases. We hypothesize that, in conventional GNNs, a bad initialization would risk a suboptimal global model shared by all nodes without recourse. In contrast, node-wise localization offers an opportunity to adjust the suboptimal model for some nodes, and is thus less susceptible to the initialization.

It has also been discussed that in GNNs overfitting to the validation set is an issue [Shchur et al.2018], and thus models with more parameters tend to perform better given a larger validation set. Therefore, we further compare different methods using a smaller validation set of only 100 nodes555These results are in Section D of the supplemental material.. Our proposed LGNN still outperforms all the baselines across the four datasets, demonstrating its power and stability.

3.3 Further Model Analysis

Due to space constraint, we only present an ablation study here666More analyses on the complexity and effect of regularization, as well as model visualization, can be found in Sections E, F and G of the supplemental material.. In the ablation study, we investigate the effectiveness of the node- and edge-level localization modules in LGNN. To validate the contribution of each module, we compare the accuracy of four variants in Fig. 2: (1) global only without any localization; (2) node-level localization only; (3) edge-level localization only; and (4) the full model with both localization modules. We observe that utilizing only one module, whether at the node or edge level, consistently outperforms the global model. Between the two modules, the node-level localization tends to perform better. Nevertheless, modeling both jointly results in the best performance, which implies that both modules are effective and possess complementary localization power.

4 Related Work

Graph representation learning has received significant attention in recent years. Earlier network embedding approaches [Perozzi et al.2014, Tang et al.2015, Grover and Leskovec2016] employ a direct embedding lookup for node representations, which are often coupled via local structures such as skip-grams. To better capture global graph structures, GNNs [Kipf and Welling2017, Hamilton et al.2017, Veličković et al.2018, Xu et al.2019] open up promising opportunities. They generally follow a paradigm of recursive neighborhood aggregation, in which each node receives information from its neighbors on the graph. More complex structural information is also exploited by recent variants [Zhang et al.2020, Pei et al.2020, Chen et al.2020], but none of them deals with the node-wise localization.

While our approach can be deemed a form of hypernetworks [Ha et al.2017, Perez et al.2018] as discussed in Sect. 2.4, several different forms of local models have been explored in related problems. In manifold learning [Yu et al.2009, Ladicky and Torr2011], local codings are used to approximate any point on the manifold as a linear combination of its surrounding anchor points. In low-rank matrix approximation [Lee et al.2013], multiple low-rank approximations are constructed for different regions of the matrix before aggregation. These methods are tightly coupled with their problems, and cannot be easily extended to localize GNNs. Some meta-learning frameworks [Vinyals et al.2016, Snell et al.2017, Finn et al.2017] can also be viewed as adapting to local contexts. Given a set of training tasks, meta-learning aims to learn a prior that can be adapted to new unseen tasks, often for addressing few-shot learning problems. In particular, existing meta-learning models on graphs are mostly designed for few-shot node classification [Zhou et al.2019, Yao et al.2020, Liu et al.2021] or regression [Liu et al.2020]. Such few-shot settings are fundamentally different from our localization objective, in which all nodes are seen during training and we leverage the local contexts of the nodes in order to enhance their representations.

5 Conclusions

In this work, we identified the need to localize GNNs for different nodes that reside in non-uniform local contexts across the graph. This motivated us to propose a node-wise localization approach, named LGNN, in order to adapt to the locality of each node. On one hand, we encode graph-level general patterns using a global weight matrix. On the other hand, we modulate the global model to generate localized weights specific to each node, and further perform an edge-level modulation to enable finer-grained localization. Thus, the proposed LGNN can capture both local and global aspects of the graph well. Finally, our extensive experiments demonstrate that LGNN significantly outperforms state-of-the-art GNNs.


This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funds (Grant No. A20H6b0151).


  • [Birnbaum et al.2019] Sawyer Birnbaum, Volodymyr Kuleshov, Zayd Enam, Pang Wei W Koh, and Stefano Ermon. Temporal FiLM: Capturing long-range sequence dependencies with feature-wise modulations. In NeurIPS, pages 10287–10298, 2019.
  • [Brockschmidt2020] Marc Brockschmidt. GNN-FiLM: Graph neural networks with feature-wise linear modulation. In ICML, pages 1144–1152, 2020.
  • [Cai et al.2018] Hongyun Cai, Vincent W Zheng, and Kevin Chang. A comprehensive survey of graph embedding: problems, techniques and applications. TKDE, 30(9):1616–1637, 2018.
  • [Chen et al.2020] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In ICML, pages 1725–1735, 2020.
  • [Finn et al.2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126–1135, 2017.
  • [Grover and Leskovec2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In KDD, pages 855–864, 2016.
  • [Ha et al.2017] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. In ICLR, 2017.
  • [Hamilton et al.2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NeurIPS, pages 1024–1034, 2017.
  • [Hou et al.2020] Yifan Hou, Jian Zhang, James Cheng, Kaili Ma, Richard TB Ma, Hongzhi Chen, and M Yang. Measuring and improving the use of graph information in graph neural networks. In ICLR, 2020.
  • [Kipf and Welling2017] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In ICLR, 2017.
  • [Ladicky and Torr2011] Lubor Ladicky and Philip HS Torr.

    Locally linear support vector machines.

    In ICML, pages 985–992, 2011.
  • [Lee et al.2013] Joonseok Lee, Seungyeon Kim, Guy Lebanon, and Yoram Singer. Local low-rank matrix approximation. In ICML, pages 82–90, 2013.
  • [Li et al.2018] Qimai Li, Zhichao Han, and Xiao-Ming Wu.

    Deeper insights into graph convolutional networks for semi-supervised learning.

    In AAAI, pages 3538–3545, 2018.
  • [Liu et al.2020] Zemin Liu, Wentao Zhang, Yuan Fang, Xinming Zhang, and Steven C.H. Hoi. Towards locality-aware meta-learning of tail node embeddings on networks. In CIKM, pages 975–984, 2020.
  • [Liu et al.2021] Zemin Liu, Yuan Fang, Chenghao Liu, and Steven C.H. Hoi. Relative and absolute location embedding for few-shot node classification on graph. In AAAI, 2021.
  • [Pei et al.2020] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-GCN: Geometric graph convolutional networks. In ICLR, 2020.
  • [Perez et al.2018] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. In AAAI, pages 3942–3951, 2018.
  • [Perozzi et al.2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. DeepWalk: Online learning of social representations. In KDD, pages 701–710, 2014.
  • [Rozemberczki et al.2021] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-Scale Attributed Node Embedding. Journal of Complex Networks, 9(2), 2021.
  • [Shchur et al.2018] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
  • [Snell et al.2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NeurIPS, pages 4077–4087, 2017.
  • [Tang et al.2015] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. LINE: Large-scale information network embedding. In TheWebConf, pages 1067–1077, 2015.
  • [Veličković et al.2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. In ICLR, 2018.
  • [Vinyals et al.2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NeurIPS, pages 3630–3638, 2016.
  • [Wu et al.2020] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. TNNLS, 32(1):4–24, 2020.
  • [Xu et al.2018] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In ICML, pages 5453–5462, 2018.
  • [Xu et al.2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.
  • [Yang et al.2016] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov.

    Revisiting semi-supervised learning with graph embeddings.

    In ICML, pages 40–48, 2016.
  • [Yao et al.2020] Huaxiu Yao, Chuxu Zhang, Ying Wei, Meng Jiang, Suhang Wang, Junzhou Huang, Nitesh V Chawla, and Zhenhui Li. Graph few-shot learning via knowledge transfer. In AAAI, pages 6656–6663, 2020.
  • [Yu et al.2009] Kai Yu, Tong Zhang, and Yihong Gong. Nonlinear learning using local coordinate coding. In NeurIPS, pages 2223–2231, 2009.
  • [Zhang et al.2020] Kai Zhang, Yaokang Zhu, Jun Wang, and Jie Zhang. Adaptive structural fingerprints for graph attention networks. In ICLR, 2020.
  • [Zhou et al.2019] Fan Zhou, Chengtai Cao, Kunpeng Zhang, Goce Trajcevski, Ting Zhong, and Ji Geng. Meta-GNN: On few-shot node classification in graph meta-learning. In CIKM, pages 137–144, 2019.