1 Introduction
The success of deep learning is rooted in a large amount of labeled data
[19, 38], while humans generalize well after having seen few examples. The contradiction between these two facts brings great attention to the research of fewshot learning [7, 20]. Fewshot learning task aims at predicting unlabeled data (query set) given a few labeled data (support set).Finetuning [4] is the defacto method in obtaining a predictive model from a small training dataset in practice nowadays. However, it suffers from overfitting issues [11]. Metalearning [8] methods introduces the concept of episode to address the fewshot problem explicitly. An episode is one round of model training, where in each episode, only few examples (e.g., 1 or 5) are randomly sampled from each class in training data. Metalearning methods adopt a trainer (also called metalearner) which takes the fewshot training data and outputs a classifier. This process is called episodic training [41]. Under the framework of metalearning, a diverse hypothesis was made to build an efficient metalearner.
A rising trend in recent researches was to process the training data with Graph Networks [2], which is a powerful model that generalizes many data structures (list, trees) while introduces a combinatorial prior over data. FewShot GNN [10] is proposed to build a complete graph network where each node feature is concatenated with the corresponding class label, then node features are updated via the attention mechanism of graph network to propagate the label information. To further exploit intracluster similarity and intercluster dissimilarity in the graphbased network, EGNN [18]
demonstrates an edgelabeling graph neural network under the
episodic training framework. It is noted that previous GNN studies in fewshot learning mainly focused on pairwise relations like node labeling or edge labeling, and ignored a large number of substantial distribution relations. Additionally, other metalearning approaches claim to make use of the benefits of global relations by episodic training, but in an implicitly way.As illustrated in Figure 1, firstly, we extract the instance feature of support and query samples. Then, we obtain the distribution feature for each sample by calculating the instancelevel similarity over all support samples. To leverage both instancelevel and distributionlevel representation of each example and process the representations at different levels independently, we propose a dualgraph architecture: a point graph (PG) and a distribution graph (DG). Specifically, a PG generates a DG by gathering 1vsn relation on every example, while the DG refines the PG by delivering distribution relations between each pair of examples. Such cyclic transformation adequately fuses instancelevel and distributionlevel relations and multiple generations (rounds) of this GatherCompare process concludes our approach. Furthermore, it is easy to extend DPGN to semisupervised fewshot learning task where support set containing both labeled and unlabeled samples for each class. DPGN builds a bridge connection between labeled and unlabeled samples in the form of similarity distribution, which leads to a better propagation for label information in semisupervised fewshot classification.
Our main contributions are summarized as follows:

To the best of our knowledge, DPGN is the first to explicitly incorporate distribution propagation in graph network for fewshot learning. The further ablation studies have demonstrated the effectiveness of distribution relations.

We devise the dual complete graph network that combines instancelevel and distributionlevel relations. The cyclic update policy in this framework contributes to enhancing instance features with distribution information.

Extensive experiments are conducted on four popular benchmark datasets for fewshot learning. By comparing with all stateoftheart methods, the DPGN achieves a significant improvement of 5%12% on average in fewshot classification accuracy. In semisupervised tasks, our algorithm outperforms existing graphbased fewshot learning methods by 7%13 %.
2 Related Work
2.1 Graph Neural Network
Graph neural networks were first designed for tasks on processing graphstructured data [34, 41]. Graph neural networks mainly refine the node representations by aggregating and transforming neighboring nodes recursively. Recent approaches [10, 25, 18] are proposed to exploit GNN in the field of fewshot learning task. TPN [25] brings the transductive setting into graphbased fewshot learning, which performs a Laplacian matrix to propagate labels from support set to query set in the graph. It also considers the similarity between support and query samples through the process of pairwise node features affinities to propagate labels. EGNN [18] uses the similarity/dissimilarity between samples and dynamically update both node and edge features for complicated interactions.
2.2 Metric Learning
Another category of fewshot learning approaches focus on optimizing feature embeddings of input data using metric learning methods. Matching Networks [41] produces a weighted nearest neighbor classifier through computing embedding distance between support and query set. Prototypical Networks [36] firstly build a prototype representation of each class in the embedding space. As an extension of Prototypical Networks, IMF [1] constructs infinite mixture prototypes by selfadaptation. RelationNet [40] adopts a distance metric network to learn pointwise relations in support and query samples.
2.3 Distribution Learning
Distribution Learning theory was first introduced in [17] to find an efficient algorithm that determines the distribution from which the samples are drawn. Different methods [16, 5, 6]
are proposed to efficiently estimate the target distributions. DLDL
[9] is one of the researches that has assigned the discrete distribution instead of onehot label for each instance in classification and regression tasks. CPNN [44] takes both features and labels as the inputs and produces the label distribution with only one hidden layer in its framework. LDLFs [35]devises a distribution learning method based on the decision tree algorithm.
2.4 Meta Learning
Some fewshot approaches adopt a metalearning framework that learns metalevel knowledge across batches of tasks. MAML [8] are gradientbased approaches that design the metalearner as an optimizer that could learn to update the model parameters (e.g., all layers of a deep network) within few optimization steps given novel examples. Reptile [28] simplifies the computation of metaloss by incorporating an L2 loss which updates the metamodel parameters towards the instancespecific adapted models. SNAIL [27] learn a parameterized predictor to estimate the parameters in models. MetaOptNet [21] advocates the use of linear classifier instead of nearestneighbor methods which can be optimized as convex learning problems. LEO [33] utilizes an encoderdecoder architecture to mine the latent generative representations and predicts highdimensional parameters in extreme lowdata regimes.
3 Method
In this section, we first provide the background of fewshot learning task, then introduce the proposed algorithm in detail.
3.1 Problem Definition
The goal of fewshot learning tasks is to train a model that can perform well in the case where only few samples are given.
Each fewshot task has a support set and a query set . Given training data , the support set contains classes with samples for each class (i.e., the way shot setting), it can be denoted as . The query set has samples and can be denoted as . Specifically, in the training stage, data labels are provided for both support set and query set . Given testing data , our goal is to train a classifier that can map the query sample from to the corresponding label accurately with few support samples from . Labels of support sets and query sets are mutually exclusive.
3.2 Distribution Propagation Graph Networks
In this section, we will explain the DPGN that we proposed for fewshot learning in detail. As shown in Figure 2. The DPGN consists of generations and each generation consists of a point graph and a distribution graph . Firstly, the feature embeddings of all samples are extracted by a convolutional backbone, these embeddings are used to compute the instance similarities . Secondly, the instance relations are delivered to construct the distribution graph . The node features are initialized by aggregating following the position order in and the edge features stand for the distribution similarities between the node features . Finally, the obtained is delivered to for constructing more discriminative representations of nodes and we repeat the above procedure generation by generation. A brief introduction of generation update for the DPGN can be expressed as , where denotes the th generation.
For further explanation, we formulate , , and as follows: , , , where . denotes the total number of examples in a training episode. is first initialized by the output of the feature extractor . For each sample :
(1) 
where and denotes the dimension of the feature embedding.
3.2.1 PointtoDistribution Aggregation
Point Similarity
Each edge in the point graph stands for the instance (point) similarity and the edge of the first generation is initialized as follows:
(2) 
where . is the encoding network that transforms the instance similarity to a certain scale. contains two ConvBNReLU [13, 15] blocks with the parameter set and a sigmoid layer.
For generation , given , and , can be updated as follows:
(3) 
In order to use edge information with a holistic view of the graph , a normalization operation is conducted on the .
P2D Aggregation
After edge features in point graph are produced or updated, the distribution graph ) is the next to be constructed. As shown in Figure 3, aims at integrating instance relations from the point graph and process the distributionlevel relations. Each distribution feature in is a
dimension feature vector where the value in
th entry represents the relation between sample and sample and stands for the total number of support samples in a task. For first initialization:(4) 
where and is the concatenation operator. is the Kronecker delta function which outputs one when and zero otherwise ( and are labels).
For generations , the distribution node can be updated as follows:
(5) 
where is the aggregation network for distribution graph. applies a concatenation operation between two features. Then, P2D performs a transformation on the concatenated features which is composed of a fullyconnected layer and ReLU [13], with the parameter set .
3.2.2 DistributiontoPoint Aggregation
Distribution Similarity
Each edge in distribution graph stands for the similarity between distribution features of different samples. For generation , the distribution similarity is initialized as follows:
(6) 
where . The encoding network transforms the distribution similarity using two ConvBNReLU blocks with the parameter set and a sigmoid layer in the end. For generation , the update rule for in is formulated as follows:
(7) 
Also, we apply a normalization to .
D2P Aggregation
As illustrated in Figure 3, the encoded distribution information in flows back into the point graph at the end of each generation. Then node features in captures the distribution relations through aggregating all the node features in with edge features as follows:
(8) 
where and is the aggregation network for point graph in with the parameter set . concatenates the feature which is computed by with the node features in previous generation and update the concatenated feature with two ConvBNReLU blocks. After this process, the node features can integrate the distributionlevel information into the instancelevel feature and prepares for computing instance similarities in the next generation.
3.3 Objective
The class prediction of each node can be computed by feeding the corresponding edges in the final generation of DPGN into softmax function:
(9) 
where
is the probability distribution over classes given sample
, and is the label of th sample in the support set. stands for the edge feature in the point graph at the final generation.Point Loss
It is noted that we make classification predictions in the point graph for each sample. Therefore, the point loss at generation is defined as follows:
(10) 
where
is the crossentropy loss function,
stands for the number of samples in each task . and are model probability predictions of sample and the groundtruth label respectively.Distribution Loss
To facilitate the training process and learn discriminative distribution features , we incorporate the distribution loss which plays a significant role in contributing to faster and better convergence. We define the distribution loss for generation as follows:
(11) 
where stands for the edge feature in the distribution graph at generation .
The total objective function is a weighted summation of all the losses mentioned above:
(12) 
where denotes total generations of DPGN and the weights and of each loss are set to balance their importance. In most of our experiments, and are set to 1.0 and 0.1 respectively.
4 Experiments
4.1 Datasets and Setups
4.1.1 Datesets
We evaluate DPGN on four standard fewshot learning benchmarks: miniImageNet [41], tieredImageNet [31], CUB2002011 [42] and CIFARFS [3]. The miniImageNet and tieredImageNet are the subsets of ImageNet [32]. CUB2002011 is initially designed for finegrained classification and CIFARFS is a subset of CIFAR100 for fewshot classification. As shown in Table 1, we list details for images number, classes number, images resolution and train/val/test splits following the criteria of previous works [41, 31, 4, 3].
Dataset  Images  Classes  Trainvaltest  Resolution 

miniImageNet  60000  100  64/16/20  84x84 
tieredImageNet  779165  608  351/97/160  84x84 
CUB2002011  11788  200  100/50/50  84x84 
CIFARFS  60000  100  64/16/20  32x32 
4.1.2 Experiment Setups
Network Architecture
We use four popular networks for fair comparison, which are ConvNet, ResNet12, ResNet18 and WRN that are used in EGNN [18], MetaOptNet [21], CloserLook [4] and LEO [33] respectively. ConvNet mainly consists of four ConvBNReLU blocks. The last two blocks also contain a dropout layer [37]. ResNet12 and ResNet18 are the same as the one described in [14]. They mainly have four blocks, which include one residual block for ResNet12 and two residual blocks for ResNet18 respectively. WRN was firstly proposed in [46]. It mainly has three residual blocks and the depth of the network is set to 28 as in [33]
. The last features of all backbone networks are processed by a global average pooling, then followed by a fullyconnected layer with batch normalization
[15] to obtain a 128dimensions instance embedding.Training Schema
We perform data augmentation before training, such as horizontal flip, random crop, and color jitter (brightness, contrast, and saturation), which are mentioned in [11, 43]. We randomly sample 28 metatask episodes in each iteration for metatraining. The Adam optimizer is used in all experiments with the initial learning rate of . We decay the learning rate by 0.1 per 15000 iterations and set the weight decay to .
Evaluation Protocols
We evaluate DPGN in 5way1shot/5shot settings on standard fewshot learning datasets, miniImageNet, tieredImageNet, CUB2002011 and CIFARFS. We follow the evaluation process of previous approaches [18, 33, 43]
. We randomly sample 10,000 tasks then report the mean accuracy (in %) as well as the 95% confidence interval.
4.2 Experiment Results
Main Results
We compare the performance of DPGN with several stateoftheart models including graph and nongraph methods. For fair comparisons, we employ DPGN on miniImageNet, tieredImageNet, CIFARFS and CUB2002011 datasets, which is compared with other methods in the same backbones. As shown in Table 2, 3 and 4, the proposed DPGN is superior to other existing methods and achieves the stateoftheart performance, especially compared with the graphbased methods.
Method  Backbone  5way1shot  5way5shot 

MatchingNet [41]  ConvNet  43.560.84  55.31 0.73 
ProtoNet [36]  ConvNet  49.420.78  68.200.66 
RelationNet [40]  ConvNet  50.440.82  65.320.70 
R2D2 [3]  ConvNet  51.200.60  68.200.60 
MAML [8]  ConvNet  48.701.84  55.310.73 
Dynamic [11]  ConvNet  56.200.86  71.940.57 
GNN [10]  ConvNet  50.330.36  66.410.63 
TPN [25]  ConvNet  55.510.86  69.860.65 
Global [26]  ConvNet  53.210.40  72.340.32 
Edgelabel [18]  ConvNet  59.630.52  76.340.48 
DPGN  ConvNet  66.010.36  82.830.41 
LEO [33]  WRN  61.760.08  77.590.12 
wDAE [12]  WRN  61.070.15  76.750.11 
DPGN  WRN  67.240.51  83.720.44 
CloserLook [4]  ResNet18  51.750.80  74.270.63 
CTM [22]  ResNet18  62.050.55  78.630.06 
DPGN  ResNet18  66.630.51  84.070.42 
MetaGAN [47]  ResNet12  52.710.64  68.630.67 
SNAIL [27]  ResNet12  55.710.99  68.880.92 
TADAM [29]  ResNet12  58.500.30  76.700.30 
ShotFree [30]  ResNet12  59.040.43  77.64 
MetaTransfer [39]  ResNet12  61.201.80  75.530.80 
FEAT [43]  ResNet12  62.960.02  78.490.02 
TapNet [45]  ResNet12  61.650.15  76.360.10 
Dense [24]  ResNet12  62.530.19  78.950.13 
MetaOptNet [21]  ResNet12  62.640.61  78.630.46 
DPGN  ResNet12  67.770.32  84.600.43 
Method  backbone  5way1shot  5way5shot 

MAML* [8]  ConvNet  51.671.81  70.301.75 
ProtoNet* [36]  ConvNet  53.340.89  72.690.74 
RelationNet* [40]  ConvNet  54.480.93  71.320.78 
TPN [25]  ConvNet  59.910.94  73.300.75 
Edgelabel [18]  ConvNet  63.520.52  80.240.49 
DPGN  ConvNet  69.430.49  85.920.42 
CTM [22]  ResNet18  64.780.11  81.050.52 
DPGN  ResNet18  70.460.52  86.440.41 
TapNet [45]  ResNet12  63.080.15  80.260.12 
MetaTransfer [39]  ResNet12  65.621.80  80.610.90 
MetaOptNet [21]  ResNet12  65.810.74  81.750.53 
ShotFree [30]  ResNet12  66.870.43  82.640.39 
DPGN  ResNet12  72.450.51  87.240.39 
Method  backbone  CUB2002011  

5way1shot  5way5shot  
ProtoNet* [36]  ConvNet  51.310.91  70.770.69 
MAML* [8]  ConvNet  55.920.95  72.090.76 
MatchingNet* [41]  ConvNet  61.160.89  72.860.70 
RelationNet* [40]  ConvNet  62.450.98  76.110.69 
CloserLook [4]  ConvNet  60.530.83  79.340.61 
DN4 [23]  ConvNet  53.150.84  81.900.60 
DPGN  ConvNet  76.050.51  89.080.38 
FEAT [43]  ResNet12  68.870.22  82.900.15 
DPGN  ResNet12  75.710.47  91.480.33 
Method  backbone  CIFARFS  
5way1shot  5way5shot  
ProtoNet* [36]  ConvNet  55.50.7  72.00.6 
MAML* [8]  ConvNet  58.91.9  71.51.0 
RelationNet* [40]  ConvNet  55.01.0  69.30.8 
R2D2 [3]  ConvNet  65.30.2  79.40.1 
DPGN  ConvNet  76.40.5  88.40.4 
ShotFree [30]  ResNet12  69.20.4  84.70.4 
MetaOptNet [21]  ResNet12  72.00.7  84.20.5 
DPGN  ResNet12  77.90.5  90.20.4 
Semisupervised Fewshot Learning
We employ DPGN on semisupervised fewshot learning. Following [25, 18], we use the same criteria to split miniImageNet dataset into labeled and unlabeled parts with different ratios. For a 20% labeled semisupervised scenario, we split the support samples with a ratio of 0.2/0.8 for labeled and unlabeled data in each class. In semisupervised fewshot learning, DPGN uses unlabeled support samples to explicitly construct similarity distributions over all other samples and the distributions work as a connection between queries and labeled support samples, which could propagate label information from labeled samples to queries sufficiently.
Method  Transduction  5way5shot 

Reptile [28]  No  62.74 
GNN [10]  No  66.41 
Edgelabel [18]  No  66.85 
DPGN  No  72.83 
MAML [8]  BN  63.11 
Reptile [28]  BN  65.99 
RelationNet [40]  BN  67.07 
MAML [8]  Yes  66.19 
TPN [25]  Yes  69.86 
Edgelabel [18]  Yes  76.37 
DPGN  Yes  84.62 
In Figure 4, DPGN shows the superiority to exsisting semisupervised fewshot methods and the result demonstrates the effectiveness to exploit the relations between labeled and unlabeled data when the label ratio decreases. Notably, DPGN surpasses TPN [25] and EGNN [18] by 11% 16% and 7% 13% respectively in fewshot average classification accuracy on miniImageNet.
Transductive Propagation
To validate the effectiveness of the transductive setting in our framework, we conduct the transductive and nontransductive experiments on miniImageNet dataset in 5way5shot setting. Table 5 shows that the accuracy of DPGN increases by a large margin in the transductive setting (comparing with nontransductive setting). Compared to TPN and EGNN which consider instancelevel features only, DPGN utilizes distribution similarities between query samples and adopts dual graph architecture to propagate label information in a sufficient way.
Highway classification
Furthermore, the performance of DPGN in highway fewshot scenarios is evaluated on miniImageNet dataset and its results are shown in Figure 5. The observed results show that DPGN not only exceeds the powerful graphbased methods [25, 18] but also surpasses the stateoftheart nongraph methods significantly. As the number of ways increasing in fewshot tasks, it can broaden the horizons of distribution utilization and make it possible for DPGN to collect more abundant distributionlevel information for queries.
4.3 Ablation Studies
Impact of Distribution Graph
The distribution graph works as an important component of DPGN by propagating distribution information, so it is necessary to investigate the effectiveness of quantitatively. We design the experiment by limiting the distribution similarities which flow to for performing aggregation in each generation during the inference process. Specifically, we mask out the edge features through keeping a different number of feature dimensions and set the value of rest dimensions to zero, since zero gives no contribution. Figure 6 shows the result for our experiment in 5way1shot on miniImageNet. It is obvious that test accuracy and the number of feature dimensions kept in have positive correlations and accuracy increment (area in blue) decreases with more feature dimensions. Keeping dimensions from to , DPGN boosts the performance nearly by 10% in absolute value and the result shows that the distribution graph has a great impact on our framework.
Generation Numbers
DPGN has a cyclic architecture that includes point graph and distribution graph, each graph has nodeupdate and edgeupdate modules respectively. The total number of generations is an important ingredient for DPGN, so we perform experiments to obtain the trend of test accuracy with different generation numbers in DPGN on miniImageNet, tieredImageNet, CUB2002011, and CIFARFS. In Figure 7, with the generation number changing from 0 to 1, the test accuracy has a significant rise. When the generation number changes from 1 to 10, the test accuracy increases by a small margin and the curve becomes to fluctuate in the last several generations. Considering that more generations need more iterations to converge, we choose generation 6 as a tradeoff between the test accuracy and convergence time. Additionally, to visualize the procedure of cyclic update, we choose a test scenario where the ground truth classes of five query images are [1, 2, 3, 4, 5] and visualize instancelevel similarities which is used for predictions of five query samples as shown in Figure 8. The heatmap shows DPGN refines the instancelevel similarity matrix after several generations and makes the right predictions for five query samples in the final generation. Notably, DPGN not only contributes to predicting more accurately but also enlarge the similarity distances between the samples in different classes through making instance features more discriminative, which cleans the prediction heatmap.
5 Conclusion
In this paper, we have presented the Distribution Propagation Graph Network for fewshot learning, a dual complete graph network that combines instancelevel and distributionlevel relations in an explicit way equipped with label propagation and transduction. The point and distribution losses are used to jointly update the parameters of the DPGN with episodic training. Extensive experiments demonstrate that our method outperforms recent stateoftheart algorithms by 5%12% in the supervised task and 7%13% in semisupervised task on fewshot learning benchmarks. For future work, we aim to focus on the highorder message propagation through encoding more complicated information which is linked with tasklevel relations.
6 Acknowledgement
This research was supported by National Key R&D Program of China (No. 2017YFA0700800).
References
 [1] Kelsey R Allen, Evan Shelhamer, Hanul Shin, and Joshua B Tenenbaum. Infinite mixture prototypes for fewshot learning. arXiv preprint arXiv:1902.04552, 2019.
 [2] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro SanchezGonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
 [3] Luca Bertinetto, Joao F. Henriques, Philip Torr, and Andrea Vedaldi. Metalearning with differentiable closedform solvers. In International Conference on Learning Representations, 2019.
 [4] WeiYu Chen, YenCheng Liu, Zsolt Kira, YuChiang Wang, and JiaBin Huang. A closer look at fewshot classification. In International Conference on Learning Representations, 2019.
 [5] Sanjoy Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634–644. IEEE, 1999.

[6]
Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A Servedio.
Learning poisson binomial distributions.
Algorithmica, 72(1):316–357, 2015.  [7] Li FeiFei, Rob Fergus, and Pietro Perona. Oneshot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.

[8]
Chelsea Finn, Pieter Abbeel, and Sergey Levine.
Modelagnostic metalearning for fast adaptation of deep networks.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 1126–1135. JMLR. org, 2017.  [9] BinBin Gao, Chao Xing, ChenWei Xie, Jianxin Wu, and Xin Geng. Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing, 26(6):2825–2838, 2017.
 [10] Victor Garcia and Joan Bruna. Fewshot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.

[11]
Spyros Gidaris and Nikos Komodakis.
Dynamic fewshot visual learning without forgetting.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4367–4375, 2018.  [12] Spyros Gidaris and Nikos Komodakis. Generating classification weights with gnn denoising autoencoders for fewshot learning. arXiv preprint arXiv:1905.01102, 2019.
 [13] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. pages 315–323, 2011.
 [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[16]
Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant.
Efficiently learning mixtures of two gaussians.
In
Proceedings of the fortysecond ACM symposium on Theory of computing
, pages 553–562. ACM, 2010.  [17] Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E Schapire, and Linda Sellie. On the learnability of discrete distributions. In STOC, volume 94, pages 273–282. Citeseer, 1994.
 [18] Jongmin Kim, Taesup Kim, Sungwoong Kim, and Chang D Yoo. Edgelabeling graph neural network for fewshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2019.

[19]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.  [20] Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society, volume 33, 2011.
 [21] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Metalearning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10657–10665, 2019.
 [22] Hongyang Li, David Eigen, Samuel Dodge, Matthew Zeiler, and Xiaogang Wang. Finding taskrelevant features for fewshot learning by category traversal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–10, 2019.
 [23] Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Yang Gao, and Jiebo Luo. Revisiting local descriptor based imagetoclass measure for fewshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7260–7268, 2019.
 [24] Yann Lifchitz, Yannis Avrithis, Sylvaine Picard, and Andrei Bursuc. Dense classification and implanting for fewshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9258–9267, 2019.
 [25] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for fewshot learning. arXiv preprint arXiv:1805.10002, 2018.
 [26] Tiange Luo, Aoxue Li, Tao Xiang, Weiran Huang, and Liwei Wang. Fewshot learning with global class representations. arXiv preprint arXiv:1908.05257, 2019.
 [27] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive metalearner. arXiv preprint arXiv:1707.03141, 2017.
 [28] Alex Nichol, Joshua Achiam, and John Schulman. On firstorder metalearning algorithms. arXiv preprint arXiv:1803.02999, 2018.
 [29] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved fewshot learning. In Advances in Neural Information Processing Systems, pages 721–731, 2018.
 [30] Avinash Ravichandran, Rahul Bhotika, and Stefano Soatto. Fewshot learning with embedded class models and shotfree meta training. arXiv preprint arXiv:1905.04398, 2019.
 [31] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Metalearning for semisupervised fewshot classification. arXiv preprint arXiv:1803.00676, 2018.
 [32] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
 [33] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Metalearning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.
 [34] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
 [35] Wei Shen, Kai Zhao, Yilu Guo, and Alan L Yuille. Label distribution learning forests. In Advances in Neural Information Processing Systems, pages 834–843, 2017.
 [36] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
 [37] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
 [38] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017.

[39]
Qianru Sun, Yaoyao Liu, TatSeng Chua, and Bernt Schiele.
Metatransfer learning for fewshot learning.
In CVPR, 2019.  [40] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for fewshot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
 [41] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
 [42] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltechucsd birds2002011 dataset. 2011.
 [43] HanJia Ye, Hexiang Hu, DeChuan Zhan, and Fei Sha. Learning embedding adaptation for fewshot learning. arXiv preprint arXiv:1812.03664, 2018.
 [44] Chao Yin and Xin Geng. Facial age estimation by conditional probability neural network. In Chinese Conference on Pattern Recognition, pages 243–250. Springer, 2012.
 [45] Sung Whan Yoon, Jun Seo, and Jaekyun Moon. Tapnet: Neural network augmented with taskadaptive projection for fewshot learning. arXiv preprint arXiv:1905.06549, 2019.
 [46] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 [47] Ruixiang Zhang, Tong Che, Zoubin Ghahramani, Yoshua Bengio, and Yangqiu Song. Metagan: An adversarial approach to fewshot learning. In Advances in Neural Information Processing Systems, pages 2365–2374, 2018.