Deep artificial agents have achieved impressive performance on various computer vision tasks like classification(Szegedy et al., 2016; He et al., 2016; Huang et al., 2017), object detection (He et al., 2017; Ren et al., 2015; Lin et al., 2017) and semantic segmentation (Chen et al., 2018; Zhang et al., 2018; Huang et al., 2019). For classification, deep recognizers have outperformed humans on visual recognition challenge for years (He et al., 2015). However, the success hinges on the ability to apply gradient-based optimization routines to high-capacity models when given large-scale training samples as supervision (Santoro et al., 2016). When the training samples are scarce, it becomes challenging to learn to recognize new concepts due to the overfitting problem. In contrast, we human can learn new concepts fast with only few examples. The gap here presents the few-shot visual recognition task.
Few-shot visual recognition, also termed few-shot learning, aims to learn novel visual concepts from one or a few labelled instances. It was first introduced by Fei-Fei et al. (Fe-Fei et al., 2003) in 2003, and has attracted lots of attention ever since. Data augmentation is a straightforward method to tackle the task (Hariharan and Girshick, 2017; Zhang et al., 2019), but it does not solve the low-data problem. A promising trend is to transfer knowledge from known categories (i.e. seen categories with abundant training examples) to new categories (i.e. novel categories with few examples) (Fe-Fei et al., 2003; Rodner and Denzler, 2010; Yu and Aloimonos, 2010), simulating the human learning process (Gentner and Holyoak, 1997; Zhou et al., 2019). To this end, a variety of approaches have been proposed to learn transferable knowledge in various forms, among which metric-learning methods that concentrate on learning transferable feature embedding fall into one of the main strands.
Given an unlabeled query sample and a limited number of labelled support samples, a metric-learning method first maps all the input samples into a latent space in which similarities between the query sample and each support sample are estimated according to a predefined distance metric, e.g. cosine distance metric (Vinyals et al., 2016) or Euclidean distance metric (Snell et al., 2017). It then labels the query sample as same as the support sample that gets the highest similarity score (Vinyals et al., 2016; Cai et al., 2018; Li et al., 2019a). While promissing, most metric-learning methods merely consider instant pairwise query-support relationships as shown in Figure 1(b) but fail to explore support-support relationships among the labelled support samples, let alone that among the unlabeled ones. Moreover, the single-image-estimated 111The image representation is independently estimated based on a single image features they used for few-shot learning are normally not discriminative enough to bring a considerable performance improvement. We also doubt the fact that the selected metric suits few-shot learning best. A predefined metric may tailor feature extractor to generate a latent space that matches this very metric perfectly but is sub-optimal for few-shot learning.
To address the above problems, we extend (Sung et al., 2018) to explicitly explore the relationships between each two samples and propose to propagate information from relevant samples for embedding enhancement. Particularly, we adopt a parameterized relation module to estimate the relationship between two samples. The module takes two samples as input and tells to what extent they belong to the same category. It is trained to produce close relationships for similar samples and distant relationships for dissimilar samples. With this module, we aim to learn a generic distance metric simutaneously when searching for the optimal latent space for few-shot learning. A weighted fully connected relation graph as illustrated in Figure 1(a) then is constructed by applying the learnt metric for pairwise comparison in the working context. We propose to explore this relation graph for embedding enhancement to facilitate few-shot learning. For enhancing the representation of an instance in the graph, we retrieve its neighbors and perform weighted information propagation to attentively aggregate visual information from the neighbors. Such an information propagation reduces noise in class representations and expands decision boundaries, actually serving as a manifold smoothing regularization. Additionally, we add a memory to serve as the working context which controls the number of samples that are available in the graph construction and information propagation. Because the memory has a flexible memory capacity, it can easily extend to hold unlabeled query samples or unlabeled auxiliary samples, making it generalize well to transductive setting or semi-supervised setting.
We dub the proposed method Memory-Augmented Relation Network (MRN). Our main contributions are as follows:
A generic distance metric is learned for few-shot learning. By designing the distance metric as a parameterized relation module, we learn transferable representations and distance metric end to end.
We propose to enhance representations via attentively aggregating information from the neighborhood context. The enhanced representations are shown to be more discriminative.
We propose a general memory-based framework for few-shot learning. It comes with tightly coupled structure in which the propagation procedure and the classifier share the same learnable distance metric.
2. Related work
Approaches that aim to tackle few-shot learning can be roughly divided into three groups: meta-learning approaches, metric-learning approaches, and hallucination based approaches. Recently, transductive learning has drawn much attention because of the performance boost. In this section, we briefly review relevant work for each of the groups and point out their relevance to our method.
Meta-learning approaches Meta-learning approaches usually utilize optimization-based meta-learning or the learning-to-learn paradigm (Hochreiter et al., 2001; Andrychowicz et al., 2016) to train a meta-learner for few-shot learning. The meta-learner either learns an optimizer for training another linear classifier (Ravi and Larochelle, 2017) or learns an optimal initialization state for the base classifier (Finn et al., 2017; Rusu et al., 2019; Lee et al., 2019), from which fast learning is feasible with few examples. The meta-learner is usually difficult to train due to its complicated memory architectures (e.g
. RNNs, recurrent neural networks) and the temporally-linear hidden state dependency(Mishra et al., 2018). Besides, an additional fine-tuning on the target few-shot learning task is further required. In contrast to those methods, the proposed MRN can be easily trained end to end from scratch without bothering to introduce any complicated memory architecture, and can adapt to new tasks directly with no need for fine-tuning.
Metric-learning approaches Metric-learning approaches mainly aim at learning transferable representations for few-shot learning (Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Koch et al., 2015; Wu et al., 2019; Li et al., 2019b; Cai et al., 2018). To this end, Vinyals et al. (Vinyals et al., 2016) presented matching networks which use an attention mechanism to derive a weighted k-NN classifier, and devised an episodic meta-learning mechanism to train the classifier for fast adaptation. Snell et al. (Snell et al., 2017) proposed to take the mean representation of support samples in each class as prototype and recognize a query sample according to its square euclidean distances against the prototypes. Li et al. (Li et al., 2019b) advanced this image-to-category measure by replacing image-level representation with a bunch of local descriptors. To overcome local connectivity, Wu et al. (Wu et al., 2019) also exploited local descriptors to deal with inherent local connectivity with a dual correlation attention mechanism. They proposed to concatenate four globally related features derived from cross-correlation attention and self-correlation attention to yield a more representative feature. Different from (Wu et al., 2019) that focuses on digging a more representative representation out of one single image, in this work, we propose to enhance the representation of a sample via aggregating information from its neighborhood, i.e. the other samples. Both (Wu et al., 2019) and our MRN are descendants of relation network (Sung et al., 2018), a deep model which extends siamese network (Koch et al., 2015) by adopting the innovative episodic meta-training mechanism proposed in (Vinyals et al., 2016).
Hallucination based approaches For a recognizer that has a feature extractor and a classifier as two distinctive components, hallucination based approaches propose to improve performance from two aspects: data augmentation (Hariharan and Girshick, 2017; Wang et al., 2018; Zhang et al., 2019)
and classifier weight vector hallucination(Qi et al., 2018; Qiao et al., 2018; Zhou et al., 2019; Gidaris and Komodakis, 2018, 2019). The data augmentation approaches commonly focus on expanding intra-class variations by hallucinating new training samples. The classifier weight vector hallucination approaches, on the other hand, propose to directly set weight vector for a novel category in the classifier. Such approached are inspired by the close relationships between classifier weight vectors and the feature representations associated with the same category produced by the feature extractor. Though straightforward, the imprinting process surprisingly provides an instant good classification performance on novel categories and an initialization for future fine-tuning.
Transductive learning In few-shot learning, the recognition of each query sample is carried out independently one by one with only a few labelled support samples. Transductive few-shot learning proposes to feed all the query samples once at a time and predict them as a whole. In this setting, the relationships among all samples, both labelled and unlabeled ones, can be considered to improve performance. For example, Liu et al. (Liu et al., 2019) presented a model that utilizes the entire query set for transductive inference, which propagates labels from labelled samples to unlabeled ones based on visual affinities. Other representatives are (Ye et al., 2020; Li et al., 2019a; Kim et al., 2019). In (Ye et al., 2020), Ye et al. trained an attention-based transformer to transform task-agnostic embeddings to task-specific embeddings for few-shot learning. The proposed FEAT selects relevant instances and combines their transformed embeddings to obtain a new task-specific embedding. Li et al. (Li et al., 2019a) followed the feature propagation idea and proposed to aggregate information from neighborhood in a tree graph. Likewise, we also propose to aggregate information from the neighborhood context for representation enhancement. The key difference between (Li et al., 2019a) and our MRN is that we extend (Sung et al., 2018) to jointly learn a generic distance metric and update representations with information from samples selected by this very distance metric.
A few-shot learning task is generally composed of two sets of instances, the support set that contains labelled instances and the query set which contains unlabeled instances. For simplicity, we consider the well-organized -shot -way few-shot setting, where the support set is prepared by sampling instances from each of the categories. Instances in query set are sampled from these categories as well but must share no individual instance with , i.e. . Few-shot learning aims to classify each unlabeled instance in with labelled instances in as supervision.
Intuitively, one can train a classifier on the support set in hope that it generalizes well to the query set. Unfortunately, this straightforward proposal suffers from severe overfitting due to the data scarcity problem. The community usually resorts to an auxiliary meta-train set to learn transferable knowledge to improve generalization on . The set contains abundant labelled examples from categories and has a disjoint label space w.r.t. the target few-shot learning task, i.e. . An effective way to exploit this meta-train set is to perform episodic meta-training with a large amount of sampled training episodes, as proposed in (Vinyals et al., 2016).
To set up a training episode , we first randomly sample categories from . Then, for each of the categories, we randomly sample labelled examples to serve as the support set . A fraction of the remainder examples of the categories are selected to act as the query set . Obviously, the training episode is actually mimicking the target -shot -way few-shot learning task . Training a model over tens of thousands of such training episodes with the objective defined in Equation (1) would yield one model that generalizes well when applied to tasks that contain novel categories.
In Equation (1), are the parameters of the model,
denotes the probability of samplebeing from category , and is the standard regularization. Following (Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Liu et al., 2019), we adopt the episodic meta-training strategy in this work.
We present a memory-augmented relation network (MRN) that has the pipeline as illuastrated in Figure 2 to solve few-shot learning problem. It basically consists of four components: a feature embedding function for mapping similar images to nearby points in latent space, an information aggregation component together with an episodic memory for representation enhancement, a relation module that learns to tell whether two images are from the same category, and a metric-based classifier for making the final predictions. In the following sections, we explain the information aggregation component and the metric-based classifier in detail and give a brief description of the relation module. The embedding function is left untouched because MRN is capable of working with all possible backbones.
4.1. The Metric-based Classifier
We first introduce our parameter-free metric-based classifier in order to make a clear impression. The classifier, in one word, takes an unlabeled query image and a bunch of labelled support images as input, and goes through an image-to-category measure to assign the query image to the category that it most likely belongs to. Let be the feature embeddings of instances in a given -shot -way training episode , it imposes the cross entropy loss on each image for network training:
where represents the -th class centroid that is computed as the mean value of all support images in the category as defined in (Snell et al., 2017), is the index label with if the image belongs to the -th category and otherwise, and denotes a certain distance metric. Likewise, let be the feature embeddings of instances in the target -shot -way few-shot learning task , in the test phase, it labels each query image according to its distance to the class centroids as defined in Equation 3.
4.2. Embedding Enhancement within Episodic Memory
The metric-based classifier highly relies on discriminative representations to predict correct labels. However, in few-shot learning, the embedding function is usually not sufficiently trained to model representative semantics in the training data. When feeding these raw features into a metric-based classifier, it leads to inferior performance. To alleviate the problem, we propose to enhance a sample by adding extra information propagated from other similar or relevant samples to its representation. Specifically, when working with episodic meta-training, for each image in a training episode , we select its nearest neighbors from available samples and update its feature embedding via gathering information from these neighbors as in Equation 4 that followed.
where denotes the representation of the instance , is the coefficient that controls information flow during the aggregation as defined in Equation 5,
represents some distance metric, and the hyperparameterindicates the proportion of information preserved to avoid disturbance from neighborhood.
The information propagation could be naturally simulated by message passing in graph. We therefore organize all instances as a weighted relation graph as defined in Equation 6:
where the edge reflects the visual affinity between sample and . For each node , we select its -nearest neighbors based on edge weights and perform the aforementioned weighted information aggregation to update its node embedding, which accordingly updates image representation because the two are identical. Theoretically, the information aggregation can be performed iteratively:
where denotes the current search depth. By increasing , we can aggregate information from more samples, like the neighbors of neighbors. Consequently, the deeper we go into the graph, the denser the updated embeddings are. However, a big hurts the classification performance. Imagining the situation where is big enough, all images collapse to the same point in the latent space which makes the classification impossible.
Undoubtedly, the number of potentially exploitable instances in matters in the propagation procedure. In this work, we keep an episodic memory to serve as and consider a more controllable propagation procedure for embedding enhancement. The memory has an adjustable memory size in case it can hold flexible number instances. As depicted in Figure 2
, we initialize the memory with features extracted by the embedding function, and update the memory whenever the information aggregation changes the feature stored in one memory slot. With the help of the memory cache, our embedding enhancement adapts very easily to transductive few-shot learning or semi-supervised few-shot learning by initializing the memory to hold unlabeled samples that we intend to exploit during the weighted information propagation. Different from other work that adopt memory (Santoro et al., 2016; Cai et al., 2018), the memory in our work lives as long as an episode exists but cannot survive across episodes. When a new episode arrives, the old memory is destroyed and a new memory initialized with new data is created.
4.3. Relation Module: Learning to Compare
We design the relation module as simple as the parameterized CNN model depicted in Figure 3. Like euclidean distance metric, it takes two input instances and estimates the distance between them. For example, let and be the feature embeddings of sample and , the distance between them can be computed by Equation 8 as follows:
where denotes the preprocessing before pairwise comparison. We use the simple difference operation as defined in Equation 9 for preprocessing:
As shown in Figure 2, the relation module is utilized as a shared submodule in the information aggregation procedure and the metric-based classifier for relation reasoning and classification respectively. That is:
We will show in our experiments that the use of relation module in the information propagation and metric-based classifier boosts performance in comparison to that with plain euclidean distance metric.
In this section, we evaluated and compared our MRN with state-of-the-art approaches on two few-shot learning benchmark datasets, i.e. miniImagenet (Snell et al., 2017) and tieredImagenet (Ren et al., 2018).
miniImagenet The miniImagenet dataset, originally introduced by Vinyals et al. (Vinyals et al., 2016), is derived from the larger ILSVRC-12 dataset (Russakovsky et al., 2015). It consists of 60,000 color images of size that are divided into 100 categories with 600 examples each. We follow (Ravi and Larochelle, 2017; Snell et al., 2017) to split the dataset into 64 base categories for training, 16 novel categories for validation, and 20 novel categories for testing, respectively. In few-shot learning , the target model is trained on images from the 64 training categories and evaluated on the 20 novel categories. The validation set is used for monitoring generalization performance only.
tieredImagenet The tieredImagenet (Ren et al., 2018) is another subset of the ILSVRC-12 dataset (Russakovsky et al., 2015), but it has a much larger class cardinality (608 classes) than that of miniImagenet. The dataset contains more than 700,000 images in total. The average number of examples in each class is more than 1200. Each of its 608 classes is derived from 34 high-level categories in Imagenet. The 34 high-level categories are further divided into 20 training (351 classes), 6 validation (97 classes) and 8 test (160 classes) categories. This high-level split strategy provides a more challenging and realistic few-shot setting where the training classes are distinct from the test classes semantically.
5.2. Implementation Details
Conv-4-64 as the basic feature extractor in our model. The backbone contains 4 convolutional blocks. Each block has a 64-filter convolutional layer with kernel size 3, a batch normalization layer, a relu activation layer, and amax pooling layer. We utilize the relation module depicted in Figure 3 (as mentioned in Section 4.3) for relation reasoning. The number of filters in each block of the relation module is set to 64 (i.e., ) in order to work with the feature extractor.
Episodic training We adopt the episodic meta-training strategy proposed in (Vinyals et al., 2016) for rapid learning in our -shot -way experiments. In each training episode, besides the support images, the 1-shot 5-way episode contains 15 query images per each of the sampled categories. The number in the 5-shot 5-way episode is 10 accordingly. It means one episode in total has training images for 1-shot 5-way experiments. All images are resized to . In the training phase, we also do basic data augmentations like random cropping and horizontal flipping to increase intra-class variations. The model is optimized with Adam (Finn et al., 2017) optimizer end-to-end from scratch. The learning rate is initially set to 0.001, and weight decay is set to . For experiments on miniImagenet, we trained the model over 200,000 randomly sampled episodes and reduced the learning rate by half for every 50,000 episodes. As for experiments on tiered
Imagenet, we trained the model over 500,000 randomly sampled training episodes and reduced the learning rate by half for every 100,000 episodes. All our experiments are implemented in Pytorch with a GeForce GTX 1080 Ti Nvidia GPU card.
Hyperparameters We estimated the hyperparameters by cross validation on miniImagenet. In all our experiments, we set the neighborhood discovery hyperparameter , information aggregation hyperparameter/search depth , and information preservation hyperparameter . Namely, we selected the 20-nearest neighbors of an example as its neighborhood, and only propagated information from its 1-order neighbors for embedding enhancement even through information aggregation from neighborhood with arbitrary depth is possible. After the aggregation procedure in Equation 4, an embedding consists of 20 percents of old information and 80 percents of new information aggregated from its neighborhood.
Evaluation We batched 15/10 query images per category in each episode for evaluation in 1-shot/5-shot 5-way classification. In all settings, we conducted few-shot classification on 1000 randomly sampled episodes from the test set and reported the mean accuracy together with the confidence interval.
5.3. Baseline methods
We mainly compare with state-of-the-art methods that concentrate on representation learning or metric learnning to demonstrate the efficiency of our proposed MRN. Since we aim to extend Relation Network (RN) (Sung et al., 2018) to exploit pairwise relationships for embedding enhancement, RN and its variant PARN (Wu et al., 2019) would serve as the direct baselines. We also take TPN (Liu et al., 2019), MNE (Li et al., 2019a), FEAT (Ye et al., 2020), GNN (Garcia and Bruna, 2018) and MM-Net (Cai et al., 2018) as baselines because our MRN shares some commonalities with them. To put it in a nutshell, we all propose to exploit the working context, the labelled instances or the unlabeled instances or them both, to boost few-shot learning in either a transductive manner or a non-transductive manner. Among them, the transductive methods, i.e. TPN (Liu et al., 2019), MNE (Li et al., 2019a) and FEAT (Ye et al., 2020), taking advantage of both labelled and unlabeled instances, provide quite strong baselines.
We additionally provide MRN-Zero and MRN-Euclid as two baselines. MRN-Zero represents the variant that discards the working memory to avoid information propagation. With no memory, it loses the ability to aggregate information from other instances and inevitably falls back to the plain relation work. But MRN-Zero is slightly different from RN in that its relation module fuses two embeddings via differentiation preprocessing before pairwise comparison, rather than the concatenation in RN. MRN-Euclid shares a similar structure with our MRN except that it utilizes euclidean distance metric for relation estimation during the information aggregation procedure instead of the learnt one, which decouples the information propagation from distance metric learning. By introducing MRN-Euclid, we aim to demonstrate the superiority of the proposed MRN that has a compact structure and tightly coupled workflow.
5.4. Main Results
|5-way Acc (, miniImagenet)||5-way Acc (, tieredImagenet)|
|MAML (Finn et al., 2017)||ICML’ 17||Conv-4-64||N||Y|
|ProtoNet (Snell et al., 2017)||NeurIPS’ 17||Conv-4-64||N||N|
|ProtoNet(C+) (Snell et al., 2017)||NeurIPS’ 17||Conv-4-64||N||N||-||-|
|GNN (Garcia and Bruna, 2018)||ICLR’ 18||Conv-4-256||N||N||-||-|
|RN (Sung et al., 2018)||CVPR’ 18||Conv-4-64||N||N|
|MM-Net (Cai et al., 2018)||CVPR’ 18||Conv-4-64||N||N||-||-|
|DN4 (Li et al., 2019b)||CVPR’ 19||Conv-4-64||N||N||-||-|
|PARN (Wu et al., 2019)||ICCV’ 19||Conv-4-64||N||N||-||-|
|TPN (Liu et al., 2019)||ICLR’ 19||Conv-4-64||Y||N|
|TPN(K+) (Liu et al., 2019)||ICLR’ 19||Conv-4-64||Y||N|
|MNE (Li et al., 2019a)||ICCV’ 19||Conv-4-64||Y||N|
|FEAT (Ye et al., 2020)||CVPR’ 20||Conv-4-64||N||N||-||-|
|FEAT (Ye et al., 2020)||CVPR’ 20||Conv-4-64||Y||N||-||-|
|Hallucination based approaches|
|ActivationNet (Qiao et al., 2018)||CVPR’ 18||Conv-4-64||N||N||-||-|
We validate the effectiveness of the proposed MRN on standard miniImagenet and tieredImagenet datasets, and report the main results in Table 1.
Comparison to previous state-of-the-arts As shown in Table 1, MRN achieves classification accuracy for 1-shot/5-shot 5-way tasks on miniImagenet, and achieves classification accuracy for 1-shot/5-shot 5-way tasks on tieredImagenet. It outperforms non-transductive methods, such as ProtoNet, RN, GNN, DN4 and PARN, by a large margin on both datasets, which comfirms that exploring the working context by explicitly aggregating visual information from unlabeled instances benefits few-shot learning a lot. Thus the effectiveness of our MRN is proved. When compared to transductive methods, MRN largely outperforms TPN and its high-shot variant in both 1-shot and 5-shot learning on two datasets, and compares favorably with FEAT in 1-shot setting. MNE provides a strong baseline that is better than other previous work. In comparison, the performance of MRN is inferior to that of MNE on miniImagenet. But, our MRN gets more than improvement over MNE in 1-shot/5-shot setting on tieredImagenet and achieves state-of-the-art performance. It should be noted that the MRN tends to achieve higher improvements or better results on tieredImagenet than on miniImagenet. We conjecture this is primarily due to there existing more categories and more training examples in the training split of tieredImagenet dataset which in turn presents richer intra-class variations that facilitates model training. We also observe that MRN consistently makes more gains for 1-shot learning in comparison to its 5-shot counterpart, which indicates that the proposed embedding enhancement in Section 4.2 is more helpful when training data is extremely scarce.
Comparison among MRN variants Among the three MRN variants, MEN-Zero refuses to propagate information from other instances for embedding enhancement and consequently achieves inferior performance when compared to previous transductive methods. This fits in with expectation because transductive methods consistently outperform non-transductive methods in few-shot learning. When compared to RN, we can observe that it gets competitive results on miniImagenet and better results on tieredImagenet, which demonstrates the effectiveness of differentiation preprocessing used in our relation module as described in Section 4.3. MRN-Euclid outperforms MRN-Zero by a large margin on miniImagenet and tieredImagenet for 1-shot learning, but surprisingly falls behind MRN-Zero on both datasets by around for 5-shot learning. We conclude this is mostly due to its decoupled workflow where euclidean distance is adopted for relation estimation during embedding enhancement and decouples the embedding enhancement from distance metric learning. In contrast, MRN consistently gets improvements, demonstrating the superiority of our compact model.
Visualization We visualize the similarity matrices learned by RN and our MRN under 5-shot 5-way setting on miniImagenet. In detail, we randomly sampled a 5-shot 5-way task from the test split. For each category in the task, we select 10 query instances as in the training phase. Then all instances are fed into the target models for generating the matrices. Given the feature representations extracted by the backbone, RN computes its similarity matrix based on the pairwise similarities estimated by the learnt distance metric. Our MRN performs weighted information propagation as described in Section 4.2 to enhance representations right before it estimates the similarities. We visualize the matrices in Figure 4. It can be seen that MRN generates a matrix that is much closer to the ground truth. We also sampled a 1-shot 5-way task with 59 query images per each category, extracted the features and visualized them using t-SNE in Figure 5. The first information we can get from the figure is that the backbone or the embedding function can effectively learn to map similar images to nearby points in the embedding space (see Fighre 5(a)). From Figure 5(c), we can observe that the propagation procedure dramatically decreases the intra-class variations and increases the inter-class variations. The embeddings after explicitly aggregating information from other instances in the working context are shown to be more discriminative.
5.5. Ablation study
To understand MRN better, we carried out several controlled experiments to examine how each part affects the final performance.
Sensitivities to hyperparameters The number of neighbors, i.e. the neighborhood discovery hyperparameter , is vital for information propagation. In Figure 6(a), we investigate the setting of . It can be observed that MRN with always surpass MRN-Zero (the baseline model or the plain relation network with ). We conclude that embedding enhancement by aggregating information from the neighborhood helps. But, we also notice that the gain drops when too few () or too many () neighbors are exploited. This is mainly because a tiny neighborhood can only provide limited amount of information and a huge neighborhood imposes irrelevant information which makes it difficult for our MRN to further boost performance.
In Figure 6(b), we compare different values of , the information aggregation hyperparameter used in Equation 7 that controls how deep into the relation graph we go. By increasing , we iteratively collect more information from less relevant instances. From the results, we can see that the performance tends to decrease as the search depth increases. For 5-shot classification, the performance of MRN with is even worse than that of the plain relation network MRN-Zero with , though it still achieves superior performance for 1-shot classification.
Impact of information preservation In Equation 4, we set the hyperparameter for information preservation. A smaller means more information aggregated from the neighborhood, and correspondingly a bigger means more information from the instance itself. To see how the performance changes along with , we conducted sensitivity experiments as shown in Figure 6(c). As can be seen from the results, the performance drops slightly in 5-shot setting and drastically in 1-shot setting when less and less information is aggregated from the neighborhood. However, our MRN is able to perform neck to neck with MRN-Zero even when in which situation nearly no information is aggregated. Quite unexpectedly, we find that the MRN outperforms MRN-Zero when only one percent of the original information is preserved, with percents of the information aggregated from the neighbors. This indicates the fact that the proposed MRN is effective enough to discriminate relevant examples from irrelevant ones and borrow information from the relevant examples to form more discriminative embeddings.
|5-way Acc ()|
Effectiveness of weighted embedding aggregation General aggregation strategies like mean-pooling and max-pooling can be used in the information aggregation procedure. In Table 2, we compared the proposed weighted embedding aggregation with mean-pooling and max-pooling feature aggregation methods. MRN-mean is the model that employs mean pooling aggregation strategy during the information aggregation procedure. MRN-max denotes the one that uses max pooling aggregation strategy. The performance drops drastically for both MRN-mean and MRN-max. For MRN-mean, aggregating information equally from all neighbors hurts performance because the aggregation imposes too much irrelevant information. For MRN-max, the performance drop results from information loss when only the max values are selected and merged into the target representations.
In this work, we tackled few-shot learning problems from the perspective of metric learning. Different from previous work that adopt a predefined metric, such as euclidean distance or cosine distance, in the metric-based classifier for similarity measurement, we learn a generic distance metric to compare images for the same purpose. We further propose to propagate information from relevant instances in working context based on their visual affinities for embedding enhancement. We introduce the episodic memory to serve as the working context in which pairwise relationships between each two instances are estimated with the learnt metric and the representation of an instance is enhanced by attentively aggregating information from its neighborhood. We empirically demonstrate that the learnt distance metric suits few-shot learning better than the predefined ones and the enhanced embeddings are more discriminative which in turn boosts performance. The proposed MRN with this tightly coupled workflow achieved state-of-the-art results on the benchmark datasets. For future work, we plan to extend MRN to deeper models. We also notice that relation propagation is a promissing direction and will study it as a supplement to our visual information propagation in the near future.
- Andrychowicz et al. (2016) Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. 2016. Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems. 3981–3989.
et al. (2018)
Qi Cai, Yingwei Pan,
Ting Yao, Chenggang Yan, and
Tao Mei. 2018.
Memory Matching Networks for One-Shot Image
2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 4080–4088. https://doi.org/10.1109/CVPR.2018.00429
- Chen et al. (2018) Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV). 801–818.
- Fe-Fei et al. (2003) Li Fe-Fei et al. 2003. A Bayesian approach to unsupervised one-shot learning of object categories. In Proceedings Ninth IEEE International Conference on Computer Vision. IEEE, 1134–1141.
et al. (2017)
Chelsea Finn, Pieter
Abbeel, and Sergey Levine.
Model-agnostic meta-learning for fast adaptation of
deep networks. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1126–1135.
- Garcia and Bruna (2018) Victor Garcia and Joan Bruna. 2018. Few-Shot Learning with Graph Neural Networks. In International Conference on Learning Representations.
- Gentner and Holyoak (1997) Dedre Gentner and Keith J Holyoak. 1997. Reasoning and learning by analogy: Introduction. American psychologist 52, 1 (1997), 32.
- Gidaris and Komodakis (2018) Spyros Gidaris and Nikos Komodakis. 2018. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4367–4375.
Spyros Gidaris and Nikos
Generating Classification Weights With GNN Denoising Autoencoders for Few-Shot Learning. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 21–30. https://doi.org/10.1109/CVPR.2019.00011
- Hariharan and Girshick (2017) Bharath Hariharan and Ross Girshick. 2017. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE International Conference on Computer Vision. 3018–3027.
- He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961–2969.
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision. 1026–1034.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Hochreiter et al. (2001) Sepp Hochreiter, A Steven Younger, and Peter R Conwell. 2001. Learning to learn using gradient descent. In International Conference on Artificial Neural Networks. Springer, 87–94.
- Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.
- Huang et al. (2019) Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision. 603–612.
- Kim et al. (2019) Jongmin Kim, Taesup Kim, Sungwoong Kim, and Chang D Yoo. 2019. Edge-Labeling Graph Neural Network for Few-shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11–20.
et al. (2015)
Gregory Koch, Richard
Zemel, and Ruslan Salakhutdinov.
Siamese neural networks for one-shot image
ICML deep learning workshop, Vol. 2.
- Lee et al. (2019) Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. 2019. Meta-Learning With Differentiable Convex Optimization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 10657–10665. https://doi.org/10.1109/CVPR.2019.01091
- Li et al. (2019a) Suichan Li, Dapeng Chen, Bin Liu, Nenghai Yu, and Rui Zhao. 2019a. Memory-Based Neighbourhood Embedding for Visual Recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6102–6111.
- Li et al. (2019b) Wenbin Li, Lei Wang, Jinglin Xu, Jing Huo, Yang Gao, and Jiebo Luo. 2019b. Revisiting Local Descriptor based Image-to-Class Measure for Few-shot Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7260–7268.
- Lin et al. (2017) Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2117–2125.
- Liu et al. (2019) Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. 2019. LEARNING TO PROPAGATE LABELS: TRANSDUCTIVE PROPAGATION NETWORK FOR FEW-SHOT LEARNING. In International Conference on Learning Representations.
- Mishra et al. (2018) Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. 2018. A simple neural attentive meta-learner. In International Conference on Learning Representations.
- Qi et al. (2018) Hang Qi, Matthew Brown, and David G Lowe. 2018. Low-shot learning with imprinted weights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5822–5830.
- Qiao et al. (2018) Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan L Yuille. 2018. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7229–7238.
- Ravi and Larochelle (2017) Sachin Ravi and Hugo Larochelle. 2017. Optimization as a model for few-shot learning. In International Conference on Learning Representations.
- Ren et al. (2018) Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. 2018. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676 (2018).
- Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91–99.
- Rodner and Denzler (2010) Erik Rodner and Joachim Denzler. 2010. One-shot learning of object categories using dependent gaussian processes. In Joint Pattern Recognition Symposium. Springer, 232–241.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211–252.
- Rusu et al. (2019) Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. 2019. Meta-Learning with Latent Embedding Optimization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. https://openreview.net/forum?id=BJgklhAcK7
- Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. 2016. Meta-learning with memory-augmented neural networks. In International conference on machine learning. 1842–1850.
- Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems. 4077–4087.
- Sung et al. (2018) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. 2018. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1199–1208.
- Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
- Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems. 3630–3638.
- Wang et al. (2018) Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. 2018. Low-shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7278–7286.
- Wu et al. (2019) Ziyang Wu, Yuwei Li, Lihua Guo, and Kui Jia. 2019. PARN: Position-Aware Relation Networks for Few-Shot Learning. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019. IEEE, 6658–6666.
- Ye et al. (2020) Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. 2020. Few-Shot Learning via Embedding Adaptation with Set-to-Set Functions. In Computer Vision and Pattern Recognition (CVPR).
Yu and Aloimonos (2010)
Xiaodong Yu and Yiannis
Attribute-based transfer learning for object categorization with zero/one training example. InEuropean conference on computer vision. Springer, 127–140.
- Zhang et al. (2018) Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. 2018. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7151–7160.
- Zhang et al. (2019) Hongguang Zhang, Jing Zhang, and Piotr Koniusz. 2019. Few-shot Learning via Saliency-guided Hallucination of Samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2770–2779.
- Zhou et al. (2019) Linjun Zhou, Peng Cui, Shiqiang Yang, Wenwu Zhu, and Qi Tian. 2019. Learning to Learn Image Classifiers With Visual Analogy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11497–11506.