In this paper, we propose a novel edge-labeling graph neural network (EGNN), which adapts a deep neural network on the edge-labeling graph, for few-shot learning. The previous graph neural network (GNN) approaches in few-shot learning have been based on the node-labeling framework, which implicitly models the intra-cluster similarity and the inter-cluster dissimilarity. In contrast, the proposed EGNN learns to predict the edge-labels rather than the node-labels on the graph that enables the evolution of an explicit clustering by iteratively updating the edge-labels with direct exploitation of both intra-cluster similarity and the inter-cluster dissimilarity. It is also well suited for performing on various numbers of classes without retraining, and can be easily extended to perform a transductive inference. The parameters of the EGNN are learned by episodic training with an edge-labeling loss to obtain a well-generalizable model for unseen low-data problem. On both of the supervised and semi-supervised few-shot image classification tasks with two benchmark datasets, the proposed EGNN significantly improves the performances over the existing GNNs.READ FULL TEXT VIEW PDF
We propose to study the problem of few-shot learning with the prism of
We propose to study the problem of few shot graph classification in grap...
Existing graph-network-based few-shot learning methods obtain similarity...
Interpreting deep neural networks from the ordinary differential equatio...
Few-shot learning can find the latent structure information between the ...
Graph Neural Networks (GNN) has demonstrated the superior performance in...
Node features and structural information of a graph are both crucial for...
A lot of interest in meta-learning  has been recently arisen in various areas including especially task-generalization problems such as few-shot learning [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], learn-to-learn [16, 17, 18]
, non-stationary reinforcement learning[19, 20, 21], and continual learning [22, 23]
. Among these meta-learning problems, few-shot leaning aims to automatically and efficiently solve new tasks with few labeled data based on knowledge obtained from previous experiences. This is in contrast to traditional (deep) learning methods that highly rely on large amounts of labeled data and cumbersome manual tuning to solve a single task.
Recently, there has also been growing interest in graph neural networks (GNNs) to handle rich relational structures on data with deep neural networks [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]. GNNs iteratively perform a feature aggregation from neighbors by message passing, and therefore can express complex interactions among data instances. Since few-shot learning algorithms have shown to require full exploitation of the relationships between a support set and a query [10, 11, 3, 2, 5], the use of GNNs can naturally have the great potential to solve the few-shot learning problem.
proposed to first construct a graph where all examples of the support set and a query are densely connected. Each input node is represented by the embedding feature (e.g. an output of a convolutional neural network) and the given label information (e.g. one-hot encoded label). Then, it classifies the unlabeled query by iteratively updating node features from neighborhood aggregation.Liu et al.  proposed a transductive propagation network (TPN) on the node features obtained from a deep neural network. At test-time, it iteratively propagates one-hot encoded labels over the entire support and query instances as a whole with a common graph parameter set. Here, it is noted that the above previous GNN approaches in few-shot learning have been mainly based on the node-labeling framework, which implicitly models the intra-cluster similarity and inter-cluster dissimilarity.
On the contrary, the edge-labeling framework is able to explicitly perform the clustering with representation learning and metric learning, and thus it is intuitively a more conducive framework for inferring a query association to an existing support clusters. Furthermore, it does not require the pre-specified number of clusters (e.g. class-cardinality or ways) while the node-labeling framework has to separately train the models according to each number of clusters. The explicit utilization of edge-labeling which indicates whether the associated two nodes belong to the same cluster (class) have been previously adapted in the naive (hyper) graphs for correlation clustering  and the GNNs for citation networks or dynamical systems [36, 37], but never applied to a graph for few-shot learning. Therefore, in this paper, we propose an edge-labeling GNN (EGNN) for few-shot leaning, especially on the task of few-shot classification.
The proposed EGNN consists of a number of layers in which each layer is composed of a node-update block and an edge-update block. Specifically, across layers, the EGNN not only updates the node features but also explicitly adjusts the edge features, which reflect the edge-labels of the two connected node pairs and directly exploit both the intra-cluster similarity and inter-cluster dissimilarity. As shown in Figure 1, after a number of alternative node and edge feature updates, the edge-label prediction can be obtained from the final edge feature. The edge loss is then computed to update the parameters of EGNN with a well-known meta-learning strategy, called episodic training [9, 2]. The EGNN is naturally able to perform a transductive inference to predict all test (query) samples at once as a whole, and this has shown more robust predictions in most cases when a few labeled training samples are provided. In addition, the edge-labeling framework in the EGNN enables to handle various numbers of classes without remodeling or retraining. We will show by means of experimental results on two benchmark few-shot image classification datasets that the EGNN outperforms other few-shot learning algorithms including the existing GNNs in both supervised and semi-supervised cases.
Our main contributions can be summarized as follows:
The EGNN is first proposed for few-shot learning with iteratively updating edge-labels with exploitation of both intra-cluster similarity and inter-cluster dissimilarity. It is also able to be well suited for performing on various numbers of classes without retraining.
It consists of a number of layers in which each layer is composed of a node-update block and an edge-update block where the corresponding parameters are estimated under the episodic training framework.
Both of the transductive and non-transductive learning or inference are investigated with the proposed EGNN.
On both of the supervised and semi-supervised few-shot image classification tasks with two benchmark datasets, the proposed EGNN significantly improves the performances over the existing GNNs. Additionally, several ablation experiments show the benefits from the explicit clustering as well as the separate utilization of intra-cluster similarity and inter-cluster dissimilarity.
Graph neural networks were first proposed to directly process graph structured data with neural networks as of form of recurrent neural networks[28, 29]. Li et al. 
further extended it with gated recurrent units and modern optimization techniques. Graph neural networks mainly do representation learning with a neighborhood aggregation framework that the node features are computed by recursively aggregating and transforming features of neighboring nodes. Generalized convolution based propagation rules also have been directly applied to graphs[38, 39, 34], and Kipf and Welling 
especially applied it to semi-supervised learning on graph-structured data with scalability. A few approaches[6, 12] have explored GNNs for few-shot learning and are based on the node-labeling framework.
Correlation clustering (CC) is a graph-partitioning algorithm  that infers the edge labels of the graph by simultaneously maximizing intra-cluster similarity and inter-cluster dissimilarity. Finley and Joachims 
considered a framework that uses structured support vector machine in CC for noun-phrase clustering and news article clustering.Taskar  derived a max-margin formulation for learning the edge scores in CC for producing two different segmentations of a single image. Kim et al.  explored a higher-order CC over a hypergraph for task-specific image segmentation. The attention mechanism in a graph attention network has recently extended to incorporate real-valued edge features that are adaptive to both the local contents and the global layers for modeling citation networks . Kipf et al.  introduced a method to simultaneously infer relational structure with interpretable edge types while learning the dynamical model of an interacting system. Johnson  introduced the Gated Graph Transformer Neural Network (GGT-NN) for natural language tasks, where multiple edge types and several graph transformation operations including node state update, propagation and edge update are considered.
One main stream approach for few-shot image classification is based on representation learning and does prediction by using nearest-neighbor according to similarity between representations. The similarity can be a simple distance function such as cosine or Euclidean distance. A Siamese network  works in a pairwise manner using trainable weighted distance. A matching network  further uses an attention mechanism to derive an differentiable nearest-neighbor classifier and a prototypical network  extends it with defining prototypes as the mean of embedded support examples for each class. DEML  has introduced a concept learner to extract high-level concept by using a large-scale auxiliary labeled dataset showing that a good representation is an important component to improve the performance of few-shot image classification.
A meta-learner that learns to optimize model parameters extract some transferable knowledge between tasks to leverage in the context of few-shot learning. Meta-LSTM  uses LSTM as a model updater and treats the model parameters as its hidden states. This allows to learn the initial values of parameters and update the parameters by reading few-shot examples. MAML  learns only the initial values of parameters and simply uses SGD. It is a model agnostic approach, applicable to both supervised and reinforcement learning tasks. Reptile  is similar to MAML but using only first-order gradients. Another generic meta-learner, SNAIL , is with a novel combination of temporal convolutions and soft attention to learn an optimal learning strategy.
In this section, the definition of few-shot classification task is introduced, and the proposed algorithm is described in detail.
The few-shot classification aims to learn a classifier when only a few training samples per each class are given. Therefore, each few-shot classification task contains a support set , a labeled set of input-label pairs, and a query set , an unlabeled set on which the learned classifier is evaluated. If the support set contains labeled samples for each of unique classes, the problem is called -way -shot classification problem.
Recently, meta-learning has become a standard methodology to tackle few-shot classification. In principle, we can train a classifier to assign a class label to each query sample with only the compact support set of the task. However, a small number of labeled support samples for each task are not sufficient to train a model fully reflecting the inter- and intra-class variations, which often leads to unsatisfactory classification performance. Meta-learning on explicit training set resolves this issue by extracting transferable knowledge that allows us to perform better few-shot learning on the support set, and thus classify the query set more successfully.
As an efficient way of meta-learning, we adopt episodic training [9, 2] which is commonly employed in various literatures [3, 4, 5]. Given a relatively large labeled training dataset, the idea of episodic training is to sample training tasks (episodes) that mimic the few-shot learning setting of test tasks. Here, since the distribution of training tasks is assumed to be similar to that of test tasks, the performances of the test tasks can be improved by learning a model to work well on the training tasks.
More concretely, in episodic training, both training and test tasks of the -way -shot problem are formed as follows: where and . Here, is the number of query samples, and and are the th input data and its label, respectively. is the set of all classes of either training or test dataset. Although both the training and test tasks are sampled from the common task distribution, the label spaces are mutually exclusive, i.e. . The support set in each episode serves as the labeled training set on which the model is trained to minimize the loss of its predictions over the query set . This training procedure is iteratively carried out episode by episode until convergence.
Finally, if some of support samples are unlabeled, the problem is referred to as semi-supervised few-shot classification. In Section 4, the effectiveness of our algorithm on semi-supervised setting will be presented.
This section describes the proposed EGNN for few-shot classification, as illustrated in Figure 2. Given the feature representations (extracted from a jointly trained convolutional neural network) of all samples of the target task, a fully-connected graph is initially constructed where each node represents each sample, and each edge represents the types of relationship between the two connected nodes; Let be the graph constructed with samples from the task , where and denote the set of nodes and edges of the graph, respectively. Let and be the node feature of and the edge feature of , respectively. is the total number of samples in the task . Each ground-truth edge-label is defined by the ground-truth node labels as:
Each edge feature is a 2-dimensional vector representing the (normalized) strengths of the intra- and inter-class relations of the two connected nodes. This allows to separately exploit the intra-cluster similarity and the inter-cluster dissimilairity.
Node features are initialized by the output of the convolutional embedding network , where is the corresponding parameter set (see Figure 3.(a)). Edge features are initialized by edge labels as follows:
where is the concatenation operation.
The EGNN consists of layers to process the graph, and the forward propagation of EGNN for inference is an alternative update of node feature and edge feature through layers.
In detail, given and from the layer , node feature update is firstly conducted by a neighborhood aggregation procedure. The feature node at the layer is updated by first aggregating the features of other nodes proportional to their edge features, and then performing the feature transformation; the edge feature at the layer is used as a degree of contribution of the corresponding neighbor node like an attention mechanism as follows:
where , and
is the feature (node) transformation network, as shown in Figure3.(b), with the parameter set . It should be noted that besides the conventional intra-class aggregation, we additionally consider inter-class aggregation. While the intra-class aggregation provides the target node the information of “similar neighbors”, the inter-class aggregation provides the information of “dissimilar neighbors”.
Then, edge feature update is done based on the newly updated node features. The (dis)similarities between every pair of nodes are re-obtained, and the feature of each edge is updated by combining the previous edge feature value and the updated (dis)similarities such that
where is the metric network that computes similarity scores with the parameter set (see Figure 3.(c)). In specific, the node feature flows into edges, and each element of the edge feature vector is updated separately from each normalized intra-cluster similarity or inter-cluster dissimilarity. Namely, each edge update considers not only the relation of the corresponding pair of nodes but also the relations of the other pairs of nodes. We can optionally use two separate metric networks for the computations of each of similarity or dissimilarity (e.g. separate instead of ).
After number of alternative node and edge feature updates, the edge-label prediction can be obtained from the final edge feature, i.e. . Here,
can be considered as a probability that the two nodesand are from the same class. Therefore, each node can be classified by simple weighted voting with support set labels and edge-label prediction results. The prediction probability of node can be formulated as :
where is the Kronecker delta function that is equal to one when and zero otherwise. Alternative approach for node classification is the use of graph clustering; the entire graph
can be first partitioned into clusters, using the edge prediction and an optimization for valid partitioning via linear programming, and then each cluster can be labeled with the support label it contains the most. However, in this paper, we simply apply Eq. (7) to obtain the classification results. The overall algorithm for the EGNN inference at test-time is summarized in Algorithm 1. The non-transductive inference means the number of query samples = 1 or it performs the query inference one-by-one, separately, while the transductive inference classifies all query samples at once in a single graph.
Given training tasks at a certain iteration during the episodic training, the parameters of the proposed EGNN,
, are trained in an end-to-end fashion by minimizing the following loss function:
where and are the set of all ground-truth query edge-labels and the set of all (real-valued) query-edge predictions of the task at the layer, respectively, and the edge loss is defined as binary cross-entropy loss. Since the edge prediction results can be obtained not only from the last layer but also from the other layers, the total loss combines all losses that are computed in all layers in order to improve the gradient flow in the lower layers.
We evaluated and compared our EGNN 111The code and models are available on https://github.com/khy0809/fewshot-egnn. with state-of-the-art approaches on two few-shot learning benchmarks, i.e. miniImageNet  and tieredImageNet .
It is the most popular few-shot learning benchmark proposed by  derived from the original ILSVRC-12 dataset . All images are RGB colored, and of size 84 84 pixels, sampled from 100 different classes with 600 samples per class. We followed the splits used in  - 64, 16, and 20 classes for training, validation and testing, respectively.
Similar to miniImageNet dataset, tieredImageNet  is also a subset of ILSVRC-12 . Compared with miniImageNet, it has much larger number of images (more than 700K) sampled from larger number of classes (608 classes rather than 100 for miniImageNet). Importantly, different from miniImageNet, tieredImageNet adopts hierarchical category structure where each of 608 classes belongs to one of 34 higher-level categories sampled from the high-level nodes in the Imagenet. Each higher-level category contains 10 to 20 classes, and divided into 20 training (351 classes), 6 validation (97 classes) and 8 test (160 classes) categories. The average number of images in each class is 1281.
For feature embedding module, a convolutional neural network, which consists of four blocks, was utilized as in most few-shot learning models [2, 4, 3, 6] without any skip connections 222Resnet-based models are excluded for fair comparison.. More concretely, each convolutional block consists of 3
3 convolutions, a batch normalization and a LeakyReLU activation. All network architectures used in EGNN are described in details in Figure3.
For both datasets, we conducted a 5-way 5-shot experiment which is one of standard few-shot learning settings. For evaluation, each test episode was formed by randomly sampling 15 queries for each of 5 classes, and the performance is averaged over 600 randomly generated episodes from the test set. Especially, we additionally conducted a more challenging 10-way experiment on miniImagenet, to demonstrate the flexibility of our EGNN model when the number of classes are different between meta-training stage and meta-test stage, which will be presented in Section 4.5.
The proposed model was trained with Adam optimizer with an initial learning rate of and weight decay of . The task mini-batch sizes for meta-training were set to be 40 and 20 for 5-way and 10-way experiments, respectively. For mini
ImageNet, we cut the learning rate in half every 15,000 episodes while for tieredImageNet, the learning rate is halved for every 30,000 because it is larger dataset and requires more iterations to converge. All our code was implemented in Pytorch and run with NVIDIA Tesla P40 GPUs.
The few-shot classification performance of the proposed EGNN model is compared with several state-of-the-art models in Table 0(a) and 0(b). Here, as presented in , all models are grouped into three categories with regard to three different transductive settings; “No” means non-transductive method, where each query sample is predicted independently from other queries, “Yes” means transductive method where all queries are simultaneously processed and predicted together, and “BN” means that query batch statistics are used instead of global batch normalization parameters, which can be considered as a kind of transductive inference at test-time.
The proposed EGNN was tested with both transductive and non-transductive settings. As shown in Table 0(a), EGNN shows the best performance in 5-way 5-shot setting, on both transductive and non-transductive settings on miniImagenet. Notably, EGNN performed better than node-labeling GNN , which supports the effectiveness of our edge-labeling framework for few-shot learning. Moreover, EGNN with transduction (EGNN + Transduction) outperformed the second best method (TPN ) on both datasets, especially by large margin on miniImagenet. Table 0(b) shows that the transductive setting on tieredImagenet gave the best performance as well as large improvement compared to the non-transductive setting. In TPN, only the labels of the support set are propagated to the queries based on the pairwise node feature affinities using a common Laplacian matrix, so the queries communicate to each other only via their embedding feature similarities. In contrast, our proposed EGNN allows us to consider more complicated interactions between query samples, by propagating to each other not only their node features but also edge-label information across the graph layers having different parameter sets. Furthermore, the node features of TPN are fixed and never changed during label propagation, which allows them to derive a closed-form, one-step label propagation equation. On the contrary, in our EGNN, both node and edge features are dynamically changed and adapted to the given task gradually with several update steps.
For semi-supervised experiment, we followed the same setting described in  for fair comparison. It is a 5-way 5-shot setting, but the support samples are only partially labeled. The labeled samples are balanced among classes so that all classes have the same amount of labeled and unlabeled samples. The obtained results on miniImagenet are presented in Table 2. Here, “LabeledOnly” denotes learning with only labeled support samples, and “Semi” means the semi-supervised setting explained above. Different results are presented according to when 20% and 40%, 60% of support samples were labeled, and the proposed EGNN is compared with node-labeling GNN . As shown in Table 2, semi-supervised learning increases the performances in comparison to labeled-only learning on all cases. Notably, the EGNN outperformed the previous GNN  by a large margin (61.88% vs 52.45%, when 20% labeled) on semi-supervised learning, especially when the labeled portion was small. The performance is even more increased on transductive setting (EGNN-Semi(T)). In a nutshell, our EGNN is able to extract more useful information from unlabeled samples compared to node-labeling framework, on both transductive and non-transductive settings.
|Labeled Ratio (5-way 5-shot)|
|# of EGNN layers|
|Intra & Inter||67.99||73.19||76.37|
The proposed edge-labeling GNN has a deep architecture that consists of several node and edge-update layers. Therefore, as the model gets deeper with more layers, the interactions between task samples should be propagated more intensively, which may leads to performance improvements. To support this statement, we compared the few-shot learning performances with different numbers of EGNN layers, and the results are presented in Table 3. As the number of EGNN layers increases, the performance gets better. There exists a big jump on few-shot accuracy when the number of layers changes from 1 to 2 (67.99% 73.19%), and a little additional gain with three layers (76.37 %).
Another key ingredient of the proposed EGNN is to use separate exploitation of intra-cluster similarity and inter-cluster dissimilarity in node/edge updates. To validate the effectiveness of this, we conducted experiment with only intra-cluster aggregation and compared the results with those obtained by using both aggregations. The results are also presented in Table 3. For all EGNN layers, the use of separate inter-cluster aggregation clearly improves the performances.
It should also be noted that compared to the previous node-labeling GNN, the proposed edge-labeling framework is more conducive in solving the few-shot problem under arbitrary meta-test setting, especially when the number of few-shot classes for meta-testing does not match to the one used for meta-training. To validate this statement, we conducted a cross-way experiment with EGNN, and the result is presented in Table 4. Here, the model was trained with 5-way 5-shot setting and tested on 10-way 5-shot setting, and vice versa. Interestingly, both cross-way results are similar to those obtained with the matched-way settings. Therefore, we can observe that the EGNN can be successfully extended to modified few-shot setting without re-training of the model, while the previous node-labeling GNN  is not even applicable to cross-way setting, since the size of the model and parameters are dependent on the number of ways.
|Model||Train way||Test way||Accuracy|
Figure 4 shows t-SNE  visualizations of node features for the previous node-labeling GNN and EGNN. The GNN tends to show a good clustering among support samples after the first layer-propagation, however, query samples are heavily clustered together, and according to each label, query samples and their support samples never get close together, especially even with more layer-propagations, which means that the last fully-connect layer of GNN actually seems to perform most roles in query classification. In contrast, in our EGNN, as the layer-propagation goes on, both the query and support samples are pulled away if their labels are different, and at the same time, equally labeled query and support samples get close together.
For further analysis, Figure 5 shows how edge features propagate in EGNN. Starting from the initial feature where all query edges are initialized with 0.5, the edge feature gradually evolves to resemble ground-truth edge label, as they are passes through the several EGNN layers.
This work addressed the problem of few-shot learning, especially on the few-shot classification task. We proposed the novel EGNN which aims to iteratively update edge-labels for inferring a query association to an existing support clusters. In the process of EGNN, a number of alternative node and edge feature updates were performed using explicit intra-cluster similarity and inter-cluster dissimilarity through the graph layers having different parameter sets, and the edge-label prediction was obtained from the final edge feature. The edge-labeling loss was used to update the parameters of the EGNN with episodic training. Experimental results showed that the proposed EGNN outperformed other few-shot learning algorithms on both of the supervised and semi-supervised few-shot image classification tasks. The proposed framework is applicable to a broad variety of other meta-clustering tasks. For future work, we can consider another training loss which is related to the valid graph clustering such as the cycle loss . Another promising direction is graph sparsification, e.g. constructing -nearest neighbor graphs , that will make our algorithm more scalable to larger number of shots.
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT)(No. NRF-2017R1A2B2006165) and Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No.2016-0-00563, Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion). Also, we thank the Kakao Brain Cloud team for supporting to efficiently use GPU clusters for large-scale experiments.
International Journal of Computer Vision, 115(3):211–252, 2015.