1 Introduction
Deep neural networks, e.g., Convolutional Neural Networks (CNNs), have been widely applied and achieve superior results on computer vision tasks such as image classification, segmentation, etc. The conventional approach is by supervised learning over a largescale labeled dataset for the task, thanks to the scalability of CNNs. However, for some tasks with only a few samples available, training a generic and highlyflexible deep model may result in overfitting and thus fail to generalize. Such a challenge presents in fewshot learning fei2006one
, in which a classifier is learned to predict the labels of the query samples using only a few labeled support samples of each class. The training set contains only data of classes different from the test. Various methods have been recently proposed for fewshot learning
vinyals2016matching ; snell2017prototypical ; sung2018learning ; garcia2017few , including the popular metalearning framework vinyals2016matching based on episodic training. Metalearning splits the training set into a large number of subtasks to simulate the testing task, which are used to train a metalearner to generalize knowledge for unseen classes. Another approach for fewshot learning is by metric learning vinyals2016matching ; snell2017prototypical ; sung2018learning , which aim to learn a generalizable feature embedding to fully exploit correlation between samples and classes.Most of the proposed fewshot learning methods are based on CNNs, which are effective at modeling local and Euclideanstructure data properties. However, in fewshot learning tasks, CNNs have limitations in exploiting the intra and interclass relationships that are typically nonlocal and nonEuclidean structured. Therefore, more recent works focused on the Graph Neural Network (GNN) or Graph Convolutional Networks (GCN). The paper garcia2017few first applied the denseGNN for fewshot learning. Transductive Propagation Network (TPN) liu2018learning introduced transductive mechanism to utilize entire query set for transductive inference. Moreover, EdgeGNN kim2019edge considered propagation of edge label information to model the intraclass similarity and the interclass dissimilarity. AdarGCN zhang2020adargcn proposed a new fewshot setting called FSFSL and proposed an adaptive aggregation GCN to remove the noise from support images crawled from the web. However, several works li2018deeper ; rong2019dropedge reported the overfitting and oversmoothing issues when learning GNN models that become deep (i.e., poor scalability), as applying GCN or GNN is a special form of Laplacian smoothing, which averages the neighbors of the target nodes. Very recent work rong2019dropedge attempted to alleviate these obstacles via randomly dropping graph edges in training, showing promising improvement for node classification. To the best of our knowledge, no work to date has addressed these issues on GNNbased methods for fewshot learning using graph attention mechanism.
In this work, we propose a novel Attentive GNN for highlyscalable and effective fewshot learning. We incorporate a novel tripleattention mechanism, i.e., node selfattention, neighborhood attention, and layer memory attention, to tackle the overfitting and oversmoothing challenges towards more effective fewshot image classification. Specifically, the node selfattention exploits internode and interclass correlation beyond CNNbased features. Neighborhood attention modules impose sparsity on the adjacency matrices, to attend to the most related neighbor nodes. Layer memory attention applies dense connection to earlierlayer “memory” of node features. Furthermore, we explain how the proposed attentive modules help GNN generate discriminative features, and alleviate oversmoothing and overfitting, with feature visualization and theoretical analysis. We conduct extensive experiments showing that the proposed Attentive GNN outperforms the stateoftheart GNNbased methods for fewshot image classification over the miniImageNet and TieredImageNet datasets, under both inductive and transductive settings.
2 Related Work
GNN for Fewshot Learning. GNN bruna2013spectral ; defferrard2016convolutional was first proposed for learning with graphstructured data and has proved to be a powerful technique for aggregating information from neighboring vertices. GNN was first used for fewshot learning garcia2017few , which aims to learn a complete graph network of nodes with both feature and class information. Based on episodic training mechanism, metagraph parameters were trained to predict the label of a query node on the graph. Then TPN liu2018learning introduced transductive setting into fewshot learning. Based on relation information between nodes, TPN constructs a topk graph with a closeform label propagation. In addition to utilizing node label information, EGNN kim2019edge considers edge information for directed graph by defining both class and edge labels for fully exploring the internal information of the graph.
Attention Mechanism. Attention Mechanism vaswani2017attention aims to focus more on regions which are more related for tasks and less on unrelated regions by learning a mask matrix or a weighted matrix. In particular, selfattention cheng2016long ; parikh2016decomposable considers the inherent correlation (attention) of the input features itself, which is mostly applied in graph model. For node classification, Graph Attention Network (GAT) velivckovic2017graph
uses a graph attention layer to learn a weighted parameter vector based on entire neighborhoods to update node representation. By utilizing graph convolution, SAGPool
lee2019self selected the top k nodes to generate mask matrix for graph pooling. Moreover, attention mechanism is also utilized for fewshot learning hou2019cross ; ke2020compare ; zhumulti . Cross Attention Module (CAM) hou2019cross generated cross attention maps for each pair of nodes to highlight the object regions for classification. Considering the attention between query sample with each support class, Category Traversal Module (CTM) li2019finding finds taskrelevant features based on both intraclass commonality and interclass uniqueness in the feature space. Inspired by nonlocal block, Binary Attention Network ke2020compare considered a nonlocal attention module to learn the similarity globally. For fewshot finegrained image recognition, MattML zhumulti uses attention mechanisms of the base learner and task learner to capture discriminative parts of images.3 Attentive Graph Neural Networks
3.1 Preliminaries
GNN sperduti1997supervised ; bruna2013spectral ; defferrard2016convolutional are the neural networks based on a graph structure with nodes and edges . Different from the classic CNNs that mainly exploit the local features (e.g., image patch textures, sparsity) for representation, GNN regards each sample (e.g., image) as a vertex on the graph, and focuses on mining the important neighborhood information of each node, which are critical to construct discriminative and generalizable features for many tasks, e.g., node classification, fewshot learning, etc. To be specific, considering a multistage GNN model, the output of the th GNN layer can be represented as
(1) 
where denotes the input feature and denotes the feature of node in the th layer, with and being the number of nodes and feature dimension at the th layer. Besides, is called the weighted adjacency matrix,
is the trainable linear transformation, and
denotes a nonlinear function, e.g., ReLU or LeakyReLU.There are different ways to construct the adjacency matrix , e.g., indicates whether node and are directly connected in the classic GCN bruna2013spectral . Besides, can be the similarity or distance between node and vinyals2016matching , i.e., , where denotes the node feature embedding, and the parameter
can be fixed or learned. One popular example is to apply cosine correlation as the similarity metric, while a more flexible method is to learn a multilayer perceptron (MLP) as the metric,
i.e., , where denotes the absolute function. More recent works applied Gaussian similarity function to construct the adjacency, e.g., TPN liu2018learning proposed the similarity function as , with being an examplewise lengthscale parameter learned by a relation network of nodes used for normalization.Different from the classic GNNs, the recently proposed GAT velivckovic2017graph exploited attention mechanism amongst all neighbor nodes in the feature domain, after the linear transformation and computes the weighted parameter based on attention coefficients for graph update as
(2) 
where denotes the set of the neighbor (i.e., connected) nodes of . GAT considers selfattention on the nodes after the linear transformation . With a shared attention mechanism, GAT allows all neighbor nodes to attend on the target node. Moreover, GAT can generate a more powerful feature representation with an extension of multihead attention. However, GAT only considers the relationship among neighbors in the same layer while it fails to utilize the layerwise information, which may lead to oversmoothing. Furthermore, GAT just applies selfattention based on node features, while ignoring label information.
3.2 What We Propose: Attentive GNN
We propose an Attentive GNN model which contains three attentive mechanisms, i.e., node selfattention, neighborhood attention, and layer memory attention. Fig. 1 shows the endtoend pipeline of Attentive GNN for the fewshot learning problem. We discuss each of the attention mechanisms, followed by how Attentive GNN is applied for fewshot learning.
3.2.1 Node SelfAttention
Denote the feature of each sample (i.e., node) as , and the onehot vector of its corresponding label as , where is the feature dimension, is the total number of classes, and . The onehot vector sets only the element corresponding to the groundtruth category to be 1, while the others are all set to 0. We propose the node selfattention to exploit the interclass and intersample correlation at the initial stage. Denote the sample matrices and label matrices as
(3) 
The first step is to calculate the sample and label correlation matrices as
(4) 
Here, is the normalization matrix defined as , denotes a pointwise product operator, and denotes a rowwise softmax operator for the sample and label correlation matrices. Take the sample correlation as an example, and let . The rowwise softmax operator is defined as
(5) 
where denotes the set of nodes that are connected to the th node.
The proposed node selfattention module exploits the correlation amongst both sample features and label vectors, which should share the information from different perspectives for the same node. Thus, the next step is to fuse and using trainable kernels as
(6) 
where denotes the attention map concatenation, and is the kernel coefficients. With the fused selfattention map, both the feature and the label vectors are updated on the nodes:
(7) 
where is a weighting parameter. Different from the feature update, the label update preserves the initial labels, which are the ground truth, in the support set, using the weighting parameter to regularize the label update. The updated sample features and labels are concatenated to form the node features .
3.2.2 Graph Neighbor Attention via Sparsity
Similar to various successful GNN framework, the proposed Attentive GNN applies a MLP to learn the adjacency matrix for feature updates. When the GNN model becomes deeper, the risk of oversmoothing increases as GNN tends to mix information from all neighbor nodes and eventually converge to a stationary point in training. To tackle this challenge, we propose a novel graph neighbor attention via sparsity constraint to attend to the most related nodes:
(8) 
Here, denotes the th row of , denotes the ratio of nodes maintained for feature update, and is the number of graph nodes. With the constraint, the adjacency matrix has up to nonzeros in each row, corresponding to the attended neighbor nodes. The solution to (3.2.2) is achieved using the projection onto a unit ball, i.e., keeping the elements of each with the largest magnitudes wen2015structured . Since the solution to (3.2.2) is nondifferentiable, we apply alternating projection for training, i.e.,
in each epoch
’s are first updated using backpropagation, followed by applying (3.2.2) to update which is constrained to be sparse.3.2.3 Layer Memory Attention
To avoid the oversmoothing and overfitting issues due to “overmixing” neighboring nodes information, another approach is to attend to the “earlier memory” of intermediate features at previous layers. Inspired by DenseNet huang2017densely , JKNet xu2018representation , GFCN ji2020gfcn and fewshot GNN garcia2017few , we densely connect the output of each GNN layer, as the intermediate GNNnode features maintain the consistent and more general representation across different GNN layers.
The proposed attentive GNN applies the transition function based on (1). In addition, we utilize graph selfloop i.e., identity matrix to utilize self information as
(9) 
where means rowwise feature concatenation and . Furthermore, instead of using directly as the input node feature at the +th layer, we propose to attend to the “early memory” by concatenating the node feature at the th layer as
(10) 
With the proposed layer memory attention module, though the node feature increases as the mode becomes deep, there is only new features introduced in a new layer, while the other features are attended to the early memory.
3.3 Application: FewShot Learning Task
Problem Definition We apply the proposed Attentive GNN for the fewshot image classification tasks: Given a largescale and labeled training set with classes , and a fewshot testing set with classes which are mutually exclusive, i.e., , we aim to train a classification model over the training set, which could be applied to the test set with only few labeled information. Such test is called the way shot task , where is the number of labeled samples which is often set from 1 to 5, i.e., the testing set contains a support set which is labeled, and a query set to be predicted, denoted as . The and are both very small for fewshot learning.
Attentive Model for FewShot Learning Following the same strategy of episodic training vinyals2016matching and metalearning framework, we simulate way shot tasks which are randomly sampled from the training set, in which the support set includes labeled samples (e.g., images) from each of the classes and the query set includes unlabeled samples from the same classes. Each task is modeled as a complete graph garcia2017few , in which each node represents an image sample and its label. The objective is to learn the parameters of Attentive GNN using the simulated tasks, which is generalizable for an unseen fewshot task.
Loss Function For each simulated fewshot task with its query set , the parameters of the backbone feature extractor, selfattention block , and GNNs
are trained by minimizing the crossentropy loss function of classes over all query samples as:
(11) 
where and denote the predicted and groundtruth labels of the query sample , respectively. We evaluate the proposed Attentive GNN for fewshot task using both inductive and transductive settings, which correspond to , and with , respectively.
4 Why it works
Discriminative Sample Representation. It is critical to obtain the initial feature representation of the samples that are sufficiently discriminative (i.e., samples of different classes are separated) for the GNN models in fewshot tasks. However, most of the existing GNN models work with generic features using CNNbased backbone, and fail to capture the taskspecific structure. The proposed node selfattention module exploits the crosssample correlation, and can thus effectively guide the feature representation for each fewshot task. Fig. 2 compares two examples of the graph features for 5way 1shot transductive learning using tSNE visualization maaten2008visualizing , using the vanilla GNN and the proposed Attentive GNN. The vanilla GNN generates node representation that are “oversmoothed” due to bad initial feature using CNNbased backbone. On the contrary, the node selfattention module can effectively generate the discriminative features, which lead to the more promising results using the Attentive GNN.
Alleviation of OverSmoothing and OverFitting. Overfitting
arises when learning an overparametric model from the limited training data, and it is extremely severe as the objective of fewshot learning is to generalize the knowledge from training set for fewshot tasks. On the other hand,
oversmoothing phenomenon refers to the case where the features of all (connected) nodes converge to similar values as the model depth increases. We provide theoretical analysis to show that the proposed tripleattention mechanism can alleviate both overfitting and oversmoothing in GNN training. For each of the result, the proof sketch is presented, while the corresponding full proofs are included in the Appendix.Lemma 1.
The node selfattention module is equivalent to a GNN layer if as
(12) 
Proof Sketch.
Proposition 1.
Applying the node selfattention module to replace a GNN layer in Attentive GNN, reduces the trainableparameter complexity from to , where denotes the depth of MLP for generating the adjacency metric.
Proof Sketch.
The trainable parameters in a GNN layer (1) are mainly the linear transformation and the MLP, which scale as and , respectively. On the contrary, the graph selfattention only involve the kernels that are trainable. ∎
Lemma 1 and Proposition 1 prove that the node selfattention module involves much fewer trainable parameters than a normal GNN layer. Thus, applying node selfattention instead of another GNN layer will reduce the model complexity, thus lowering the risk of overfitting.
Next we show that using graph neighbor attention can help alleviate oversmoothing for training GNN. The analysis is based on the recent works on dropEdge rong2019dropedge and GNN information loss oono2020graph . They proved that a sufficiently deep GNN model will always suffer from “smoothing” oono2020graph , where is defined as the error bound of the maximum distance among node features. Another concept is the “information loss” oono2020graph of a graph model , i.e., the dimensionality reduction of the node featurespace after layers of GNNs, denoted as . We use these two concepts to quantify the oversmoothing issue in our analysis.
Theorem 1.
Denote the same multilayer GNN model with and without neighbor attention as and , respectively. Besides, denote the number of GNN layers for them to encounter the smoothing oono2020graph as and , respectively. With sufficiently small in the node selfattention module, either (i) , or (ii) , will hold.
Remarks.
The result shows that the GNN model with graph neighbor attention (i) increases the maximum number of layers to encounter oversmoothing, or if the number of layers remains, (ii) the oversmoothing phenomenon is alleviated. ∎
5 Experiments
Datasets. We conducted extensive experiments to evaluate the effectiveness of the proposed Attentive GNN model for fewshot learning, over two widelyused fewshot image classification benchmarks, i.e., miniImageNet vinyals2016matching and tieredImageNet ren2018meta . MiniImageNet contains around 60000 images of 100 different classes extracted from the ILSVRC12 challenge krizhevsky2012imagenet . We used the proposed splits by Sachin2017 , i.e., 64, 16 and 20 classes for training, validation and testing, respectively. TieredImageNet dataset is a more challenging data subset from the ILSVRC12 challenge krizhevsky2012imagenet , which contains more classes that are organized in a hierarchical structure, i.e., 608 classes from 34 top categories. We follow the setups proposed by ren2018meta , and split 34 top categories into 20 (resp. 351 classes), 6 (resp. 97 classes), and 8 (resp. 160 classes), for training, validation, and testing, respectively. For both datasets, all images are resized to .
Implementation Details. We follow most of the DNNbased fewshot learning schemes snell2017prototypical ; vinyals2016matching ; garcia2017few ; liu2018learning , and apply the popular ConvNet4 as the backbone feature extractor, with convolution kernels, numbers of channels as
at corresponding layers, a batch normalization layer, a max pooling layer, and a LeakyReLU activation layer. Besides, two dropout layers are adapted to the last two convolution blocks to alleviate overfitting
garcia2017few .We conducted both 5way 1shot, and 5way 5shot experiments, under both inductive and transductive settings liu2018learning . We use only one query sample for the inductive, and query samples per class for the transductive experiments. Our models are all trained using Adam optimizer with an initial learning rate of and a weight decay of . The minibatch sizes are set to 100 / 40 and 30 / 20, for 1shot / 5shot inductive and transductive settings, respectively. We cut the learning rate in half every 15K and 30K epochs, for experiments over miniImageNet and tieredImageNet, respectively.
miniImageNet  tieredImageNet  
Model  Trans  5way 1shot  5way 5shot  5way 1shot  5way 5shot  
MatchingNet vinyals2016matching 






ProtoNet snell2017prototypical 






Reptile nichol2018first 






RelationNet sung2018learning 






GNN garcia2017few 






SAML hao2019collect 






EGNN kim2019edge 






Ours 






Reptile nichol2018first 






MAML finn2017model 






MAML finn2017model 






RelationNet sung2018learning 






GNN garcia2017few 






EGNN kim2019edge 






TPN liu2018learning 






TPN liu2018learning 






DN4 li2019revisiting 






Ours (Normalized) 






Ours (Fusion) 





Results. We compare the proposed Attentive GNN to the stateoftheart GNNbased approaches, which are all using the same backbone, i.e., ConvNet4. Table 1 lists the average accuracies of the fewshot image classification. In general, results under the transductive setting improve from the inductive results, as the algorithm can further exploit the correlation amongst support and multiple query samples. Note that the transductive setting by EGNN kim2019edge is different from the other methods, i.e., query samples are used rather than in Attentive GNN. For all datasets and settings, Attentive GNN outperforms all the competing methods, except for transductive EGNN. Even under an “unfair” transductive setting, the proposed Attentive GNN performs comparably to EGNN in average.
Ablation Study We investigate the effectiveness of each proposed attention module by conducting an ablation study. Fig. 4 plots the image classification accuracy over the miniImageNet dataset, with different variations of the proposed Attentive GNN, by removing the node selfattention (self att) and layer memory attention (memory att) modules. Besides, instead of applying layer memory attention which attends to the concatenated early feature, we try another variation by concatenating only the label vectors (label concat). It is clear that all variations generate degraded results, and suffers from more severe oversmoothing, i.e., accuracy drops quickly as the number of GNN layers increases. We show that the label concatenation is a reasonable alternative (i.e., red curve) to replace layer memory attention which requires less memory complexity. Furthermore, we study the influence of the graph neighbor attention for fewshot learning by varying the hyperparameter . Fig. 4 plots the inductive image classification accuracy by applying Attentive GNN with varying (i.e., ratio of elements in being zero) in the graph neighbor attention module. When , it is equivalent to removing the graph neighbor attention at all. By choosing the optimal ’s for 5way 1shot and 5way 5shot settings, respectively, the graph neighbor attention can further boost the classification results.
HyperParameters. There are two hyperparameters in the proposed Attentive GNN, namely and , corresponding to the ratio for label fusion, and the sparsity ratio in the neighbor attention module. Table 5 shows how varying these two parameters affects the inductive learning for image classification averaged over tieredImageNet. Both and range between 0 and 1. Besides, we also test the model when then label fusion mechanism is totally removed, denoted as “” in the table. The empirical results demonstrate the effectiveness of label fusion with to be a reasonable ratio. Besides, for 5way 1shot learning, the best result is generated when , which is equivalent to remove the graph neighbor attention. It is because the total number of nodes is small for 5way 1shot learning, thus imposing sparsity leads to too restrictive model.
Robustness in Transductive Learning. While the query samples are always uniformly distributed for each class in the conventional transductive learning setting liu2018learning , such assumption may not hold in practice, e.g., query set contain samples with random labels. We study how robust the proposed Attentive GNN is for such setting by comparing to the baseline GNN method garcia2017few and GNN with only neighbor attention (e.g., w/ Neighbor Att.). In the training, we simulate the query set with samples with random labels correspondingly for Attentive GNN and all competing methods under such setting. Table 5 shows the image classification accuracy with 5way 1shot transductive learning, averaged over tieredImageNet. With the queryset samples of “random” labels, the proposed Attentive GNN can still generate significantly better results comparing to the vanilla GNN. Table 5 shows that the proposed graph neighbor attention module contributes to the robustness. As the sparse adjacency matrix can attend to the related nodes (i.e., nodes with the same class) in an adaptive way, preventing “overmixing” with all nodes.
6 Conclusion
In this paper, we proposed a novel Attentive GNN model for fewshot learning. The proposed Attentive GNN makes full use of the relationships between image samples for knowledge modeling and generalization By introducing a tripleattention mechanism, Attentive GNN model can effectively alleviate oversmoothing and overfitting issues when applying deep GNN models. Extensive experiments are conducted over popular miniImageNet and tieredImageNet, showing that our proposed Attentive GNN outperforms the stateoftheart GNNbased fewshot learning methods. We plan to apply Attentive GNN for other challenging applications in future work.
7 Appendix
Here we present (1) the detailed proofs of the results for the proposed Attentive GNN, i.e., Lemma 1 and Theorem 1; (2) additional experimental results applying the proposed Attentive GNN scheme for the study of the number of graph layers;
7.1 Proofs of the results for Attentive GNN
We prove the main results regarding the proposed attentive GNN. First of all, we analyze the proposed node selfattention, whose feature and label vector updates are
(13) 
where denotes the attention map, and (resp. and ) denote the input (resp. output) feature and label vectors, respectively.
We prove Lemma 1 which shows that the proposed node selfattention can alleviate Overfitting by reducing the model complexity comparing to adding more GNN layer. The output of the th GNN layer can be represented as
(14) 
Lemma 1.
The node selfattention module is equivalent to a GNN layer if as
(15) 
Proof of Lemma 1.
With the condition for equivalence, the output of the th GNN layer becomes
(16) 
Thus, (16) is equivalent to putting the node selfattention to replace the th GNN layer, with and . ∎
Next, we prove Proposition 1 which shows the model complexity decrease from a trainable GNN layer to the proposed node selfattention module.
Proposition 1.
Applying the node selfattention module to replace a GNN layer in Attentive GNN, reduces the trainableparameter complexity from to , where denotes the depth of MLP for generating the adjacency metric.
Proof of Proposition 1.
For a GNN layer following (1), both and the MLP are trainable, corresponding to free parameters scale as and , respectively. On the contrary, based on Lemma 1, the proposed node selfattention is equivalent to a GNN layer, with the and the MLP fixed. The only trainable parameters are the kernels to fuse the and , with the complexity scales as . ∎
Next we show that using graph neighbor attention can help alleviate oversmoothing for training GNN. We first quantify the degree of oversmoothing using the definitions from rong2019dropedge and oono2020graph .
Definition 1 (Feature Subspace).
Denote the dimensional subspace as the feature space, with .
Definition 2 (Projection Loss).
Denote the operator of projection onto a dimensional subspace as as
(17) 
Denote the projection loss as
(18) 
Definition 3 (smoothing).
The GNN layer that suffers from smoothing if . Given a multilayer GNN with each the feature output of each layer as , we define the smoothing layer as the minimal value that encounters smoothing, i.e.,
(19) 
Definition 4 (Dimensionality Reduction).
Suppose at the The dimensionality reduction of the node featurespace after layers of GNNs is denoted as
(20) 
With these definitions from rong2019dropedge and oono2020graph , we can now prove Theorem 1 for the graph neighbor attention as
(21) 
Here, denotes the th row of , denotes the ratio of nodes maintained for feature update, and is the number of graph nodes. Besides, is the original adjacency matrix with the graph neighbor attention.
Theorem 1.
Denote the same multilayer GNN model with and without neighbor attention as and , respectively. Besides, denote the number of GNN layers for them to encounter the smoothing oono2020graph as and , respectively. With sufficiently small in the node selfattention module, either (i) , or (ii) , will hold.
Proof of Theorem 1.
Given the original , the solution to (7.1) is achieved using the projection onto a unit ball, i.e., keeping the elements of each with the largest magnitudes wen2015structured , i.e.,
(22) 
Here, the set indexes the top elements of largest magnitude in , and denotes the complement set of . When , it is equivalent to remove the edge connecting the th node and th node. Thus, equals to the number of edges been dropped by the node selfattention, and as .
Therefore, when is sufficiently small, there are sufficient number of edges been dropped by the node selfattention. Based on the Theorem 1 in rong2019dropedge , we have either of the two to alleviate oversmoothing phenomenon:

The number of layers without smoothing increases by node selfattention, i.e., .

The information loss (i.e., dimensionality reduction by feature embedding) decreases by node selfattention, i.e.,
∎
7.2 Study on the effectiveness of different Attentive GNN Layers
Here we investigate the effectiveness of the number of GNN layers for our proposed Attentive GNN. Fig. 5 plots the classification accuracy of 5way 5shot task over the tieredImageNet under transductive setting with different Attentive GNN layers. When the number of Attentive GNN layer is set to 0, it means that we only adapt node selfattention module following with a fullyconnected layer for classification. It is clear that the test accuracy has a significant increase as the number of GNN layers increases, which is due to the powerful ability of GNN to integrate neighbor information. However, as the number of GNN layers increases, test accuracy starts to drop and our Attentive GNN model is more likely to suffer from oversmoothing when the number of GNN layers continues to increase. In conclusion, we decide to adopt a 3layer Attentive GNN model.
References
 (1) Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013)
 (2) Cheng, J., Dong, L., Lapata, M.: Long shortterm memorynetworks for machine reading. arXiv preprint arXiv:1601.06733 (2016)
 (3) Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in neural information processing systems. pp. 3844–3852 (2016)
 (4) FeiFei, L., Fergus, R., Perona, P.: Oneshot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28(4), 594–611 (2006)

(5)
Finn, C., Abbeel, P., Levine, S.: Modelagnostic metalearning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine LearningVolume 70. pp. 1126–1135. JMLR. org (2017)
 (6) Garcia, V., Bruna, J.: Fewshot learning with graph neural networks. arXiv preprint arXiv:1711.04043 (2017)
 (7) Hao, F., He, F., Cheng, J., Wang, L., Cao, J., Tao, D.: Collect and select: Semantic alignment metric learning for fewshot learning. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 8460–8469 (2019)
 (8) Hou, R., Chang, H., Bingpeng, M., Shan, S., Chen, X.: Cross attention network for fewshot classification. In: Advances in Neural Information Processing Systems. pp. 4005–4016 (2019)

(9)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)
 (10) Ji, F., Yang, J., Zhang, Q., Tay, W.P.: Gfcn: A new graph convolutional network based on parallel flows. In: ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3332–3336. IEEE (2020)
 (11) Ke, L., Pan, M., Wen, W., Li, D.: Compare learning: Biattention network for fewshot learning. In: ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2233–2237. IEEE (2020)
 (12) Kim, J., Kim, T., Kim, S., Yoo, C.D.: Edgelabeling graph neural network for fewshot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 11–20 (2019)
 (13) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)
 (14) Lee, J., Lee, I., Kang, J.: Selfattention graph pooling. arXiv preprint arXiv:1904.08082 (2019)
 (15) Li, H., Eigen, D., Dodge, S., Zeiler, M., Wang, X.: Finding taskrelevant features for fewshot learning by category traversal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–10 (2019)

(16)
Li, Q., Han, Z., Wu, X.M.: Deeper insights into graph convolutional networks for semisupervised learning. In: ThirtySecond AAAI Conference on Artificial Intelligence (2018)
 (17) Li, W., Wang, L., Xu, J., Huo, J., Gao, Y., Luo, J.: Revisiting local descriptor based imagetoclass measure for fewshot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7260–7268 (2019)
 (18) Liu, Y., Lee, J., Park, M., Kim, S., Yang, E., Hwang, S.J., Yang, Y.: Learning to propagate labels: Transductive propagation network for fewshot learning. arXiv preprint arXiv:1805.10002 (2018)
 (19) Maaten, L.v.d., Hinton, G.: Visualizing data using tsne. Journal of machine learning research 9(Nov), 2579–2605 (2008)
 (20) Nichol, A., Achiam, J., Schulman, J.: On firstorder metalearning algorithms. arXiv preprint arXiv:1803.02999 (2018)
 (21) Oono, K., Suzuki, T.: Graph neural networks exponentially lose expressive power for node classification. In: International Conference on Learning Representations (2020)
 (22) Parikh, A.P., Täckström, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933 (2016)
 (23) Ravi, S., Larochelle, H.: Optimization as a model for fewshot learning. In: In International Conference on Learning Representations (ICLR) (2017)
 (24) Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J.B., Larochelle, H., Zemel, R.S.: Metalearning for semisupervised fewshot classification. arXiv preprint arXiv:1803.00676 (2018)
 (25) Rong, Y., Huang, W., Xu, T., Huang, J.: Dropedge: Towards deep graph convolutional networks on node classification. In: International Conference on Learning Representations (2019)
 (26) Snell, J., Swersky, K., Zemel, R.: Prototypical networks for fewshot learning. In: Advances in neural information processing systems. pp. 4077–4087 (2017)
 (27) Sperduti, A., Starita, A.: Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks 8(3), 714–735 (1997)
 (28) Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for fewshot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1199–1208 (2018)
 (29) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
 (30) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
 (31) Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.: Matching networks for one shot learning. In: Advances in neural information processing systems. pp. 3630–3638 (2016)
 (32) Wen, B., Ravishankar, S., Bresler, Y.: Structured overcomplete sparsifying transform learning with convergence guarantees and applications. International Journal of Computer Vision 114(23), 137–167 (2015)
 (33) Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.i., Jegelka, S.: Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536 (2018)
 (34) Zhang, J., Zhang, M., Lu, Z., Xiang, T., Wen, J.: Adargcn: Adaptive aggregation gcn for fewshot learning. arXiv preprint arXiv:2002.12641 (2020)
 (35) Zhu, Y., Liu, C., Jiang, S.: Multiattention meta learning for fewshot finegrained image recognition. In: TwentyNinth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence. pp. 1090–1096 (07 2020). https://doi.org/10.24963/ijcai.2020/152