Attentive Graph Neural Networks for Few-Shot Learning

by   Hao Cheng, et al.
Nanyang Technological University

Graph Neural Networks (GNN) has demonstrated the superior performance in many challenging applications, including the few-shot learning tasks. Despite its powerful capacity to learn and generalize from few samples, GNN usually suffers from severe over-fitting and over-smoothing as the model becomes deep, which limit the model scalability. In this work, we propose a novel Attentive GNN to tackle these challenges, by incorporating a triple-attention mechanism, node self-attention, neighborhood attention, and layer memory attention. We explain why the proposed attentive modules can improve GNN for few-shot learning with theoretical analysis and illustrations. Extensive experiments show that the proposed Attentive GNN outperforms the state-of-the-art GNN-based methods for few-shot learning over the mini-ImageNet and Tiered-ImageNet datasets, with both inductive and transductive settings.


page 1

page 2

page 3

page 4


ATRM: Attention-based Task-level Relation Module for GNN-based Few-shot Learning

Recently, graph neural networks (GNNs) have shown powerful ability to ha...

Hybrid Graph Neural Networks for Few-Shot Learning

Graph neural networks (GNNs) have been used to tackle the few-shot learn...

Neural Consciousness Flow

The ability of reasoning beyond data fitting is substantial to deep lear...

High-order structure preserving graph neural network for few-shot learning

Few-shot learning can find the latent structure information between the ...

Few-shot Learning for Slot Tagging with Attentive Relational Network

Metric-based learning is a well-known family of methods for few-shot lea...

Attentive Recurrent Comparators

Rapid learning requires flexible representations to quickly adopt to new...

RetaGNN: Relational Temporal Attentive Graph Neural Networks for Holistic Sequential Recommendation

Sequential recommendation (SR) is to accurately recommend a list of item...

1 Introduction

Deep neural networks, e.g., Convolutional Neural Networks (CNNs), have been widely applied and achieve superior results on computer vision tasks such as image classification, segmentation, etc. The conventional approach is by supervised learning over a large-scale labeled dataset for the task, thanks to the scalability of CNNs. However, for some tasks with only a few samples available, training a generic and highly-flexible deep model may result in over-fitting and thus fail to generalize. Such a challenge presents in few-shot learning fei2006one

, in which a classifier is learned to predict the labels of the query samples using only a few labeled support samples of each class. The training set contains only data of classes different from the test. Various methods have been recently proposed for few-shot learning  

vinyals2016matching ; snell2017prototypical ; sung2018learning ; garcia2017few , including the popular meta-learning framework vinyals2016matching based on episodic training. Meta-learning splits the training set into a large number of sub-tasks to simulate the testing task, which are used to train a meta-learner to generalize knowledge for unseen classes. Another approach for few-shot learning is by metric learning vinyals2016matching ; snell2017prototypical ; sung2018learning , which aim to learn a generalizable feature embedding to fully exploit correlation between samples and classes.

Most of the proposed few-shot learning methods are based on CNNs, which are effective at modeling local and Euclidean-structure data properties. However, in few-shot learning tasks, CNNs have limitations in exploiting the intra- and inter-class relationships that are typically non-local and non-Euclidean structured. Therefore, more recent works focused on the Graph Neural Network (GNN) or Graph Convolutional Networks (GCN). The paper garcia2017few first applied the dense-GNN for few-shot learning. Transductive Propagation Network (TPN) liu2018learning introduced transductive mechanism to utilize entire query set for transductive inference. Moreover, Edge-GNN kim2019edge considered propagation of edge label information to model the intra-class similarity and the inter-class dissimilarity. AdarGCN zhang2020adargcn proposed a new few-shot setting called FSFSL and proposed an adaptive aggregation GCN to remove the noise from support images crawled from the web. However, several works li2018deeper ; rong2019dropedge reported the over-fitting and over-smoothing issues when learning GNN models that become deep (i.e., poor scalability), as applying GCN or GNN is a special form of Laplacian smoothing, which averages the neighbors of the target nodes. Very recent work rong2019dropedge attempted to alleviate these obstacles via randomly dropping graph edges in training, showing promising improvement for node classification. To the best of our knowledge, no work to date has addressed these issues on GNN-based methods for few-shot learning using graph attention mechanism.

In this work, we propose a novel Attentive GNN for highly-scalable and effective few-shot learning. We incorporate a novel triple-attention mechanism, i.e., node self-attention, neighborhood attention, and layer memory attention, to tackle the over-fitting and over-smoothing challenges towards more effective few-shot image classification. Specifically, the node self-attention exploits inter-node and inter-class correlation beyond CNN-based features. Neighborhood attention modules impose sparsity on the adjacency matrices, to attend to the most related neighbor nodes. Layer memory attention applies dense connection to earlier-layer “memory” of node features. Furthermore, we explain how the proposed attentive modules help GNN generate discriminative features, and alleviate over-smoothing and over-fitting, with feature visualization and theoretical analysis. We conduct extensive experiments showing that the proposed Attentive GNN outperforms the state-of-the-art GNN-based methods for few-shot image classification over the mini-ImageNet and Tiered-ImageNet datasets, under both inductive and transductive settings.

2 Related Work

GNN for Few-shot Learning. GNN bruna2013spectral ; defferrard2016convolutional was first proposed for learning with graph-structured data and has proved to be a powerful technique for aggregating information from neighboring vertices. GNN was first used for few-shot learning garcia2017few , which aims to learn a complete graph network of nodes with both feature and class information. Based on episodic training mechanism, meta-graph parameters were trained to predict the label of a query node on the graph. Then TPN liu2018learning introduced transductive setting into few-shot learning. Based on relation information between nodes, TPN constructs a top-k graph with a close-form label propagation. In addition to utilizing node label information, EGNN kim2019edge considers edge information for directed graph by defining both class and edge labels for fully exploring the internal information of the graph.

Attention Mechanism. Attention Mechanism vaswani2017attention aims to focus more on regions which are more related for tasks and less on unrelated regions by learning a mask matrix or a weighted matrix. In particular, self-attention cheng2016long ; parikh2016decomposable considers the inherent correlation (attention) of the input features itself, which is mostly applied in graph model. For node classification, Graph Attention Network (GAT) velivckovic2017graph

uses a graph attention layer to learn a weighted parameter vector based on entire neighborhoods to update node representation. By utilizing graph convolution, SAGPool 

lee2019self selected the top k nodes to generate mask matrix for graph pooling. Moreover, attention mechanism is also utilized for few-shot learning hou2019cross ; ke2020compare ; zhumulti . Cross Attention Module (CAM) hou2019cross generated cross attention maps for each pair of nodes to highlight the object regions for classification. Considering the attention between query sample with each support class, Category Traversal Module (CTM) li2019finding finds task-relevant features based on both intra-class commonality and inter-class uniqueness in the feature space. Inspired by non-local block, Binary Attention Network ke2020compare considered a non-local attention module to learn the similarity globally. For few-shot fine-grained image recognition, MattML zhumulti uses attention mechanisms of the base learner and task learner to capture discriminative parts of images.

3 Attentive Graph Neural Networks

3.1 Preliminaries

GNN sperduti1997supervised ; bruna2013spectral ; defferrard2016convolutional are the neural networks based on a graph structure with nodes and edges . Different from the classic CNNs that mainly exploit the local features (e.g., image patch textures, sparsity) for representation, GNN regards each sample (e.g., image) as a vertex on the graph, and focuses on mining the important neighborhood information of each node, which are critical to construct discriminative and generalizable features for many tasks, e.g., node classification, few-shot learning, etc. To be specific, considering a multi-stage GNN model, the output of the -th GNN layer can be represented as


where denotes the input feature and denotes the feature of node in the -th layer, with and being the number of nodes and feature dimension at the -th layer. Besides, is called the weighted adjacency matrix,

is the trainable linear transformation, and

denotes a non-linear function, e.g., ReLU or Leaky-ReLU.

There are different ways to construct the adjacency matrix , e.g., indicates whether node and are directly connected in the classic GCN bruna2013spectral . Besides, can be the similarity or distance between node and  vinyals2016matching , i.e., , where denotes the node feature embedding, and the parameter

can be fixed or learned. One popular example is to apply cosine correlation as the similarity metric, while a more flexible method is to learn a multi-layer perceptron (MLP) as the metric,

i.e., , where denotes the absolute function. More recent works applied Gaussian similarity function to construct the adjacency, e.g., TPN liu2018learning proposed the similarity function as , with being an example-wise length-scale parameter learned by a relation network of nodes used for normalization.

Different from the classic GNNs, the recently proposed GAT velivckovic2017graph exploited attention mechanism amongst all neighbor nodes in the feature domain, after the linear transformation and computes the weighted parameter based on attention coefficients for graph update as


where denotes the set of the neighbor (i.e., connected) nodes of . GAT considers self-attention on the nodes after the linear transformation . With a shared attention mechanism, GAT allows all neighbor nodes to attend on the target node. Moreover, GAT can generate a more powerful feature representation with an extension of multi-head attention. However, GAT only considers the relationship among neighbors in the same layer while it fails to utilize the layer-wise information, which may lead to over-smoothing. Furthermore, GAT just applies self-attention based on node features, while ignoring label information.

3.2 What We Propose: Attentive GNN

We propose an Attentive GNN model which contains three attentive mechanisms, i.e., node self-attention, neighborhood attention, and layer memory attention. Fig. 1 shows the end-to-end pipeline of Attentive GNN for the few-shot learning problem. We discuss each of the attention mechanisms, followed by how Attentive GNN is applied for few-shot learning.

3.2.1 Node Self-Attention

Denote the feature of each sample (i.e., node) as , and the one-hot vector of its corresponding label as , where is the feature dimension, is the total number of classes, and . The one-hot vector sets only the element corresponding to the ground-truth category to be 1, while the others are all set to 0. We propose the node self-attention to exploit the inter-class and inter-sample correlation at the initial stage. Denote the sample matrices and label matrices as


The first step is to calculate the sample and label correlation matrices as


Here, is the normalization matrix defined as , denotes a point-wise product operator, and denotes a row-wise softmax operator for the sample and label correlation matrices. Take the sample correlation as an example, and let . The row-wise softmax operator is defined as


where denotes the set of nodes that are connected to the -th node.

The proposed node self-attention module exploits the correlation amongst both sample features and label vectors, which should share the information from different perspectives for the same node. Thus, the next step is to fuse and using trainable kernels as


where denotes the attention map concatenation, and is the kernel coefficients. With the fused self-attention map, both the feature and the label vectors are updated on the nodes:


where is a weighting parameter. Different from the feature update, the label update preserves the initial labels, which are the ground truth, in the support set, using the weighting parameter to regularize the label update. The updated sample features and labels are concatenated to form the node features .

Figure 1: Illustration of the proposed Attentive GNN framework for few-shot learning. The nodes connected by arrows represent the selected neighbors for feature update.

3.2.2 Graph Neighbor Attention via Sparsity

Similar to various successful GNN framework, the proposed Attentive GNN applies a MLP to learn the adjacency matrix for feature updates. When the GNN model becomes deeper, the risk of over-smoothing increases as GNN tends to mix information from all neighbor nodes and eventually converge to a stationary point in training. To tackle this challenge, we propose a novel graph neighbor attention via sparsity constraint to attend to the most related nodes:


Here, denotes the -th row of , denotes the ratio of nodes maintained for feature update, and is the number of graph nodes. With the constraint, the adjacency matrix has up to non-zeros in each row, corresponding to the attended neighbor nodes. The solution to (3.2.2) is achieved using the projection onto a unit ball, i.e., keeping the elements of each with the largest magnitudes wen2015structured . Since the solution to (3.2.2) is non-differentiable, we apply alternating projection for training, i.e.,

in each epoch

’s are first updated using back-propagation, followed by applying (3.2.2) to update which is constrained to be sparse.

3.2.3 Layer Memory Attention

To avoid the over-smoothing and over-fitting issues due to “over-mixing” neighboring nodes information, another approach is to attend to the “earlier memory” of intermediate features at previous layers. Inspired by DenseNet huang2017densely , JKNet xu2018representation , GFCN ji2020gfcn and few-shot GNN garcia2017few , we densely connect the output of each GNN layer, as the intermediate GNN-node features maintain the consistent and more general representation across different GNN layers.

The proposed attentive GNN applies the transition function based on (1). In addition, we utilize graph self-loop i.e., identity matrix to utilize self information as


where means row-wise feature concatenation and . Furthermore, instead of using directly as the input node feature at the +-th layer, we propose to attend to the “early memory” by concatenating the node feature at the -th layer as


With the proposed layer memory attention module, though the node feature increases as the mode becomes deep, there is only new features introduced in a new layer, while the other features are attended to the early memory.

3.3 Application: Few-Shot Learning Task

Problem Definition  We apply the proposed Attentive GNN for the few-shot image classification tasks: Given a large-scale and labeled training set with classes , and a few-shot testing set with classes which are mutually exclusive, i.e., , we aim to train a classification model over the training set, which could be applied to the test set with only few labeled information. Such test is called the -way -shot task , where is the number of labeled samples which is often set from 1 to 5, i.e., the testing set contains a support set which is labeled, and a query set to be predicted, denoted as . The and are both very small for few-shot learning.

Attentive Model for Few-Shot Learning  Following the same strategy of episodic training vinyals2016matching and meta-learning framework, we simulate -way -shot tasks which are randomly sampled from the training set, in which the support set includes labeled samples (e.g., images) from each of the classes and the query set includes unlabeled samples from the same classes. Each task is modeled as a complete graph garcia2017few , in which each node represents an image sample and its label. The objective is to learn the parameters of Attentive GNN using the simulated tasks, which is generalizable for an unseen few-shot task.

Loss Function  For each simulated few-shot task with its query set , the parameters of the backbone feature extractor, self-attention block , and GNNs

are trained by minimizing the cross-entropy loss function of classes over all query samples as:


where and denote the predicted and ground-truth labels of the query sample , respectively. We evaluate the proposed Attentive GNN for few-shot task using both inductive and transductive settings, which correspond to , and with , respectively.

4 Why it works

Discriminative Sample Representation. It is critical to obtain the initial feature representation of the samples that are sufficiently discriminative (i.e., samples of different classes are separated) for the GNN models in few-shot tasks. However, most of the existing GNN models work with generic features using CNN-based backbone, and fail to capture the task-specific structure. The proposed node self-attention module exploits the cross-sample correlation, and can thus effectively guide the feature representation for each few-shot task. Fig. 2 compares two examples of the graph features for 5-way 1-shot transductive learning using t-SNE visualization maaten2008visualizing , using the vanilla GNN and the proposed Attentive GNN. The vanilla GNN generates node representation that are “over-smoothed” due to bad initial feature using CNN-based backbone. On the contrary, the node self-attention module can effectively generate the discriminative features, which lead to the more promising results using the Attentive GNN.

(a) Vanilla GNN
(b) Attentive GNN
Figure 2: t-SNE visualization maaten2008visualizing of the graph features under 5-way 1-shot transductive setting using (a) vanilla GNN, and (b) Attentive GNN. Samples of different classes are color-coded. Leftmost plots: the initial feature embedding; Rightmost plots: final output; Middle plot of (b): output by the node self-attention.

Alleviation of Over-Smoothing and Over-Fitting. Over-fitting

arises when learning an over-parametric model from the limited training data, and it is extremely severe as the objective of few-shot learning is to generalize the knowledge from training set for few-shot tasks. On the other hand,

over-smoothing phenomenon refers to the case where the features of all (connected) nodes converge to similar values as the model depth increases. We provide theoretical analysis to show that the proposed triple-attention mechanism can alleviate both over-fitting and over-smoothing in GNN training. For each of the result, the proof sketch is presented, while the corresponding full proofs are included in the Appendix.

Lemma 1.

The node self-attention module is equivalent to a GNN layer if as

Proof Sketch.

The feature and label vector updates using (7) is similar to multiplying with an adjacency matrix in (1), while such matrix is obtained in a self-supervised way. ∎

Proposition 1.

Applying the node self-attention module to replace a GNN layer in Attentive GNN, reduces the trainable-parameter complexity from to , where denotes the depth of MLP for generating the adjacency metric.

Proof Sketch.

The trainable parameters in a GNN layer (1) are mainly the linear transformation and the MLP, which scale as and , respectively. On the contrary, the graph self-attention only involve the kernels that are trainable. ∎

Lemma 1 and Proposition 1 prove that the node self-attention module involves much fewer trainable parameters than a normal GNN layer. Thus, applying node self-attention instead of another GNN layer will reduce the model complexity, thus lowering the risk of over-fitting.

Next we show that using graph neighbor attention can help alleviate over-smoothing for training GNN. The analysis is based on the recent works on dropEdge rong2019dropedge and GNN information loss oono2020graph . They proved that a sufficiently deep GNN model will always suffer from “-smoothing” oono2020graph , where is defined as the error bound of the maximum distance among node features. Another concept is the “information loss” oono2020graph of a graph model , i.e., the dimensionality reduction of the node feature-space after layers of GNNs, denoted as . We use these two concepts to quantify the over-smoothing issue in our analysis.

Theorem 1.

Denote the same multi-layer GNN model with and without neighbor attention as and , respectively. Besides, denote the number of GNN layers for them to encounter the -smoothing oono2020graph as and , respectively. With sufficiently small in the node self-attention module, either (i) , or (ii) , will hold.


The result shows that the GNN model with graph neighbor attention (i) increases the maximum number of layers to encounter over-smoothing, or if the number of layers remains, (ii) the over-smoothing phenomenon is alleviated. ∎

5 Experiments

Datasets. We conducted extensive experiments to evaluate the effectiveness of the proposed Attentive GNN model for few-shot learning, over two widely-used few-shot image classification benchmarks, i.e., mini-ImageNet vinyals2016matching and tiered-ImageNet ren2018meta . Mini-ImageNet contains around 60000 images of 100 different classes extracted from the ILSVRC-12 challenge krizhevsky2012imagenet . We used the proposed splits by Sachin2017 , i.e., 64, 16 and 20 classes for training, validation and testing, respectively. Tiered-ImageNet dataset is a more challenging data subset from the ILSVRC-12 challenge krizhevsky2012imagenet , which contains more classes that are organized in a hierarchical structure, i.e., 608 classes from 34 top categories. We follow the setups proposed by ren2018meta , and split 34 top categories into 20 (resp. 351 classes), 6 (resp. 97 classes), and 8 (resp. 160 classes), for training, validation, and testing, respectively. For both datasets, all images are resized to .

Implementation Details. We follow most of the DNN-based few-shot learning schemes snell2017prototypical ; vinyals2016matching ; garcia2017few ; liu2018learning , and apply the popular ConvNet-4 as the backbone feature extractor, with convolution kernels, numbers of channels as

at corresponding layers, a batch normalization layer, a max pooling layer, and a LeakyReLU activation layer. Besides, two dropout layers are adapted to the last two convolution blocks to alleviate over-fitting 

garcia2017few .

We conducted both 5-way 1-shot, and 5-way 5-shot experiments, under both inductive and transductive settings liu2018learning . We use only one query sample for the inductive, and query samples per class for the transductive experiments. Our models are all trained using Adam optimizer with an initial learning rate of and a weight decay of . The mini-batch sizes are set to 100 / 40 and 30 / 20, for 1-shot / 5-shot inductive and transductive settings, respectively. We cut the learning rate in half every 15K and 30K epochs, for experiments over mini-ImageNet and tiered-ImageNet, respectively.

mini-ImageNet tiered-ImageNet
Model Trans 5-way 1-shot 5-way 5-shot 5-way 1-shot 5-way 5-shot
Matching-Net  vinyals2016matching
Proto-Net snell2017prototypical
Reptile nichol2018first
Relation-Net sung2018learning
GNN garcia2017few
SAML hao2019collect
EGNN kim2019edge
Reptile nichol2018first
MAML finn2017model
MAML finn2017model
Relation-Net sung2018learning
GNN garcia2017few
EGNN kim2019edge
TPN liu2018learning
TPN liu2018learning
DN4 li2019revisiting
Ours (Normalized)
Ours (Fusion)
Table 1: Few-shot classification accuracy averaged over mini-ImageNet and tiered-ImageNet. The best and second best results under each setting and dataset are highlighted as Red and blue, respectively. “BN” denotes that the batch normalization where query statistical information is used instead of global batch normalization.
Figure 3: Ablation study: Classification accuracy v.s the number of layers over mini-ImageNet.
Figure 4: Variation of the : Inductive classification accuracy over mini-ImageNet.

Results. We compare the proposed Attentive GNN to the state-of-the-art GNN-based approaches, which are all using the same backbone, i.e., ConvNet-4. Table 1 lists the average accuracies of the few-shot image classification. In general, results under the transductive setting improve from the inductive results, as the algorithm can further exploit the correlation amongst support and multiple query samples. Note that the transductive setting by EGNN kim2019edge is different from the other methods, i.e., query samples are used rather than in Attentive GNN. For all datasets and settings, Attentive GNN outperforms all the competing methods, except for transductive EGNN. Even under an “unfair” transductive setting, the proposed Attentive GNN performs comparably to EGNN in average.

Ablation Study We investigate the effectiveness of each proposed attention module by conducting an ablation study. Fig. 4 plots the image classification accuracy over the mini-ImageNet dataset, with different variations of the proposed Attentive GNN, by removing the node self-attention (self att) and layer memory attention (memory att) modules. Besides, instead of applying layer memory attention which attends to the concatenated early feature, we try another variation by concatenating only the label vectors (label concat). It is clear that all variations generate degraded results, and suffers from more severe over-smoothing, i.e., accuracy drops quickly as the number of GNN layers increases. We show that the label concatenation is a reasonable alternative (i.e., red curve) to replace layer memory attention which requires less memory complexity. Furthermore, we study the influence of the graph neighbor attention for few-shot learning by varying the hyper-parameter . Fig. 4 plots the inductive image classification accuracy by applying Attentive GNN with varying (i.e., ratio of elements in being zero) in the graph neighbor attention module. When , it is equivalent to removing the graph neighbor attention at all. By choosing the optimal ’s for 5-way 1-shot and 5-way 5-shot settings, respectively, the graph neighbor attention can further boost the classification results.

Hyper-Parameters. There are two hyper-parameters in the proposed Attentive GNN, namely and , corresponding to the ratio for label fusion, and the sparsity ratio in the neighbor attention module. Table 5 shows how varying these two parameters affects the inductive learning for image classification averaged over tiered-ImageNet. Both and range between 0 and 1. Besides, we also test the model when then label fusion mechanism is totally removed, denoted as “-” in the table. The empirical results demonstrate the effectiveness of label fusion with to be a reasonable ratio. Besides, for 5-way 1-shot learning, the best result is generated when , which is equivalent to remove the graph neighbor attention. It is because the total number of nodes is small for 5-way 1-shot learning, thus imposing sparsity leads to too restrictive model.

tableInductive accuracy on tiered-ImageNet dataset with different graph settings. Here ”-” means not applying node self-attention. Hyper-Parameter Setting Accuracy 5-way 1-shot 5-way 5-shot 1.0 - 54.97 70.92 0.7 - 57.18 70.58 1.0 0 57.41 72.03 1.0 0.5 57.68 71.03 0.7 0.5 57.47 72.29 tableEffect of query samples distribution on tiered-ImageNet dataset of 5-way 1-shot task based on transductive setting. Here the total number of query samples for two settings is fixed. Model Random Uniform Vanilla GNN garcia2017few 59.77 65.11 GNN w/ Neighbor Att. 60.18 65.49 Attentive GNN 61.39 67.23

Robustness in Transductive Learning. While the query samples are always uniformly distributed for each class in the conventional transductive learning setting liu2018learning , such assumption may not hold in practice, e.g., query set contain samples with random labels. We study how robust the proposed Attentive GNN is for such setting by comparing to the baseline GNN method garcia2017few and GNN with only neighbor attention (e.g., w/ Neighbor Att.). In the training, we simulate the query set with samples with random labels correspondingly for Attentive GNN and all competing methods under such setting. Table 5 shows the image classification accuracy with 5-way 1-shot transductive learning, averaged over tiered-ImageNet. With the query-set samples of “random” labels, the proposed Attentive GNN can still generate significantly better results comparing to the vanilla GNN. Table 5 shows that the proposed graph neighbor attention module contributes to the robustness. As the sparse adjacency matrix can attend to the related nodes (i.e., nodes with the same class) in an adaptive way, preventing “over-mixing” with all nodes.

6 Conclusion

In this paper, we proposed a novel Attentive GNN model for few-shot learning. The proposed Attentive GNN makes full use of the relationships between image samples for knowledge modeling and generalization By introducing a triple-attention mechanism, Attentive GNN model can effectively alleviate over-smoothing and over-fitting issues when applying deep GNN models. Extensive experiments are conducted over popular mini-ImageNet and tiered-ImageNet, showing that our proposed Attentive GNN outperforms the state-of-the-art GNN-based few-shot learning methods. We plan to apply Attentive GNN for other challenging applications in future work.

7 Appendix

Here we present (1) the detailed proofs of the results for the proposed Attentive GNN, i.e., Lemma 1 and Theorem 1; (2) additional experimental results applying the proposed Attentive GNN scheme for the study of the number of graph layers;

7.1 Proofs of the results for Attentive GNN

We prove the main results regarding the proposed attentive GNN. First of all, we analyze the proposed node self-attention, whose feature and label vector updates are


where denotes the attention map, and (resp. and ) denote the input (resp. output) feature and label vectors, respectively.

We prove Lemma 1 which shows that the proposed node self-attention can alleviate Over-fitting by reducing the model complexity comparing to adding more GNN layer. The output of the -th GNN layer can be represented as

Lemma 1.

The node self-attention module is equivalent to a GNN layer if as

Proof of Lemma 1.

With the condition for equivalence, the output of the -th GNN layer becomes


Thus, (16) is equivalent to putting the node self-attention to replace the -th GNN layer, with and . ∎

Next, we prove Proposition 1 which shows the model complexity decrease from a trainable GNN layer to the proposed node self-attention module.

Proposition 1.

Applying the node self-attention module to replace a GNN layer in Attentive GNN, reduces the trainable-parameter complexity from to , where denotes the depth of MLP for generating the adjacency metric.

Proof of Proposition 1.

For a GNN layer following (1), both and the MLP are trainable, corresponding to free parameters scale as and , respectively. On the contrary, based on Lemma 1, the proposed node self-attention is equivalent to a GNN layer, with the and the MLP fixed. The only trainable parameters are the kernels to fuse the and , with the complexity scales as . ∎

Next we show that using graph neighbor attention can help alleviate over-smoothing for training GNN. We first quantify the degree of over-smoothing using the definitions from rong2019dropedge and oono2020graph .

Definition 1 (Feature Subspace).

Denote the -dimensional subspace as the feature space, with .

Definition 2 (Projection Loss).

Denote the operator of projection onto a -dimensional subspace as as


Denote the projection loss as

Definition 3 (-smoothing).

The GNN layer that suffers from -smoothing if . Given a multi-layer GNN with each the feature output of each layer as , we define the -smoothing layer as the minimal value that encounters -smoothing, i.e.,

Definition 4 (Dimensionality Reduction).

Suppose at the The dimensionality reduction of the node feature-space after layers of GNNs is denoted as


With these definitions from rong2019dropedge and oono2020graph , we can now prove Theorem 1 for the graph neighbor attention as


Here, denotes the -th row of , denotes the ratio of nodes maintained for feature update, and is the number of graph nodes. Besides, is the original adjacency matrix with the graph neighbor attention.

Theorem 1.

Denote the same multi-layer GNN model with and without neighbor attention as and , respectively. Besides, denote the number of GNN layers for them to encounter the -smoothing oono2020graph as and , respectively. With sufficiently small in the node self-attention module, either (i) , or (ii) , will hold.

Proof of Theorem 1.

Given the original , the solution to (7.1) is achieved using the projection onto a unit ball, i.e., keeping the elements of each with the largest magnitudes wen2015structured , i.e.,


Here, the set indexes the top- elements of largest magnitude in , and denotes the complement set of . When , it is equivalent to remove the edge connecting the -th node and -th node. Thus, equals to the number of edges been dropped by the node self-attention, and as .

Therefore, when is sufficiently small, there are sufficient number of edges been dropped by the node self-attention. Based on the Theorem 1 in rong2019dropedge , we have either of the two to alleviate over-smoothing phenomenon:

  • The number of layers without -smoothing increases by node self-attention, i.e., .

  • The information loss (i.e., dimensionality reduction by feature embedding) decreases by node self-attention, i.e.,

7.2 Study on the effectiveness of different Attentive GNN Layers

Here we investigate the effectiveness of the number of GNN layers for our proposed Attentive GNN. Fig. 5 plots the classification accuracy of 5-way 5-shot task over the tiered-ImageNet under transductive setting with different Attentive GNN layers. When the number of Attentive GNN layer is set to 0, it means that we only adapt node self-attention module following with a fully-connected layer for classification. It is clear that the test accuracy has a significant increase as the number of GNN layers increases, which is due to the powerful ability of GNN to integrate neighbor information. However, as the number of GNN layers increases, test accuracy starts to drop and our Attentive GNN model is more likely to suffer from over-smoothing when the number of GNN layers continues to increase. In conclusion, we decide to adopt a 3-layer Attentive GNN model.

Figure 5: Accuracy on tiered against different numbers of GNN layers with transductive setting for 5-way 5-shot task.