Few-shot Learning with Multi-scale Self-supervision

by   Hongguang Zhang, et al.

Learning concepts from the limited number of datapoints is a challenging task usually addressed by the so-called one- or few-shot learning. Recently, an application of second-order pooling in few-shot learning demonstrated its superior performance due to the aggregation step handling varying image resolutions without the need of modifying CNNs to fit to specific image sizes, yet capturing highly descriptive co-occurrences. However, using a single resolution per image (even if the resolution varies across a dataset) is suboptimal as the importance of image contents varies across the coarse-to-fine levels depending on the object and its class label e. g., generic objects and scenes rely on their global appearance while fine-grained objects rely more on their localized texture patterns. Multi-scale representations are popular in image deblurring, super-resolution and image recognition but they have not been investigated in few-shot learning due to its relational nature complicating the use of standard techniques. In this paper, we propose a novel multi-scale relation network based on the properties of second-order pooling to estimate image relations in few-shot setting. To optimize the model, we leverage a scale selector to re-weight scale-wise representations based on their second-order features. Furthermore, we propose to a apply self-supervised scale prediction. Specifically, we leverage an extra discriminator to predict the scale labels and the scale discrepancy between pairs of images. Our model achieves state-of-the-art results on standard few-shot learning datasets.



page 2


Multi-level Second-order Few-shot Learning

We propose a Multi-level Second-order (MlSo) few-shot learning network f...

Power Normalizing Second-order Similarity Network for Few-shot Learning

Second- and higher-order statistics of data points have played an import...

Coarse-to-Fine Embedded PatchMatch and Multi-Scale Dynamic Aggregation for Reference-based Super-Resolution

Reference-based super-resolution (RefSR) has made significant progress i...

Improving Few-shot Learning with Weakly-supervised Object Localization

Few-shot learning often involves metric learning-based classifiers, whic...

Learning to Focus: Cascaded Feature Matching Network for Few-shot Image Recognition

Deep networks can learn to accurately recognize objects of a category by...

Rethinking Class Relations: Absolute-relative Few-shot Learning

The majority of existing few-shot learning describe image relations with...

Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning

This paper presents new hierarchically cascaded transformers that can im...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

CNNs have improved the performance of tasks such as the object category recognition, scene classification and fine-grained image recognition. However, such tasks require large amounts of labeled data and time consuming training.

In contrast, human brain has the ability to learn and recognize novel objects and complex visual concepts from very few samples, which highlights the superiority of biological vision over artificial CNNs. Inspired by the brain ability to learn in the few-samples regime, current research efforts go into the so-called problem of few-shot learning for which networks are trained by the use of only few training samples. To date, a number of relation-learning deep networks have been proposed [36, 33, 34] which can be viewed as a form of metric learning [39, 22, 14] adapted to the few-shot learning scenario. We take a similar view on the one- and few-shot learning problem in this paper, however, we focus on capturing discriminative multi-scale second-order statistics.

Second-order statistics of feature datapoints have played a pivotal role in advancing the state-of-the-art on several problems in computer vision, including object recognition, texture categorization, action representation, and human tracking, to name a few of applications

[30, 13, 5, 20]. For example, in the popular region covariance descriptors [35], a covariance matrix, which is computed over multi-modal features from image regions, is used as an object representation for recognition and tracking, and has been extended to several other applications [30, 13]. Recently, second-order representations have been extended to end-to-end CNN learning and they obtained state-of-art results on action recognition, texture classification, scene and fine-grained recognition, and also few-shot learning tasks [17, 21, 42, 40, 43]. Second-order Similarity Network (SoSN) [42] is the first work, which proposes to use second-order statistics in few-shot learning task. Following SoSN, Saliency-guided Hallucination Network [43] and Few-shot Localizer [40] also employ second-order statistics to improve the accuracy which demonstrates the usefulness of second-order pooling in few-shot learning. In addition to the better representative power, second-order pooling has another advantage compared to standard CNNs, which is its ability to factor out the spatial mode of CNN feature maps, yet preserving second-order statistics of signal. Thus, the resulting CNN model is able to handle images of arbitrary resolutions and represent co-occurring patterns.

Figure 1: The illustration of scale mismatch cases in miniImagenet dataset. The top row is the support samples randomly selected from several episodes, the following rows are the failure cases in those episodes. We use red bounding box to show the failure cases, which are more or less affected by scale mismatch.

In this paper, we propose a multi-scale self-supervised network based on the above-mentioned properties of second-order pooling. Though multi-scale approaches have been used in low-level vision tasks, high-level tasks such as few-shot learning cannot easily deal with multiple scale-related feature streams due to their pair-wise comparator-like nature. To the best of our knowledge, this is the first work that investigates how to use multi-scale information to simulate class relations, and overcome the scale mismatch failures demonstrated in Figure 1

in few-shot learning. To better exploit the scale information, we leverage a Scale Selector (SS) to re-weight scale-wise representations in the soft manner. Moreover, we propose to embed a self-supervised scale predictor and a scale discrepancy predictor into our pipeline to produce discriminative scale-wise features. Specifically, we employ the so-called Scale Classifier (SC) to classify the scale indexes of given features, or the Discrepancy Classifier (DC) to recognize the discrepancy labels of given feature pairs. In this way, we retain in our network more information about self-supervisory scale selection task. Somewhat related self-supervision strategies are known to boost results in action recognition


For contributions, we (i) propose a novel multi-scale few-shot learning network, that is, we leverage second-order pooling to capture pair-wise image similarities at multiple scales without increasing the number of networks/network parameters, (ii) propose the self-supervised scale selector to smartly re-weight contributions from different scale inputs, (iii) propose the self-supervisory scale and scale discrepancy terms e. g., scale and discrepancy discriminators predict image scales and scale deviation between image pairs.

As far as we can tell, this is the first work that uses multi-scale representations to estimate ‘scale aware’ image relations in few-shot learning.

2 Related Work

In what follows, we describe recent works on transfer learning, e. g., zero-shot, one-shot and few-shot learning methodologies and also the related works using second-order pooling in computer vision scenario.

2.1 Learning From Few Samples

As the CNN’s performance is becoming higher dependant on large-scale datasets, which requires heavy manual annotation works, so researchers turn to study how to leverage the knowledge and skills learned from previous tasks and make the machine has the ability to understand new task with very low training samples.

The problem was introduced in 1901 under a notion of “transfer of particle” [41] and is closely related to zero-shot learning [24, 6, 1] which can be defined as an ability to generalize to unseen class categories from categories seen during training. For one- and few-shot learning, some “transfer of particle” is also a desired mechanism as generalizing from one or few datapoints to account for intra-class variability of thousands images is a formidable task.

Figure 2:

The network architecture of Multi-scale Second-order Relation Network (MsSoSN). The first-order convolutional features from lower scales are passed to upper levels via upsample operation, then second-order pooling is applied on convolutional feature vectors (stacked into a matrix) to produce the final multi-scale second-order representations. Thanks to second-order pooling, we use one Relation Network across all scale levels which means the overall number of parameters of MsSoSN is the same as for the backbone model SoSN

[42]. We employ scale-wise multi-level Mean Square Error (MSE) objective to train our network. Three additional components are used: (i) the shared Scale Selector (SS), which is used to re-weight multi-scale second-order features based on input information; (ii) the Scale Classifier (SC) and (iii) the scale Discrepancy Classifier (DC) both helping produce more discriminative multi-scale representations via self-learning. Note that refers to the relation score between i-th support sample and j-th query sample.

One- and Few-shot Learning has been studied widely in computer vision in both shallow [27, 26, 8, 3, 7, 23]

and deep learning scenarios

[16, 36, 33, 9, 33, 34, 10, 32, 11, 42, 43, 40, 15, 12].

Early works [7, 23] propose one-shot learning methods motivated by the observation that humans can learn new concepts from very few samples. These two papers employ a generative model with an iterative inference for transfer. Siamese Network [16]

presents a two-streams convolutional neural network approach which generates image descriptors and learns the relation between them. Matching Network

[36] introduces the concept of support set and -way -shot learning protocols. It captures the similarity between one testing and several support images, thus casting the one-shot learning problem as set-to-set learning. Prototypical Networks [33] learns a model that computes distances between a datapoint and prototype representations of each class. Model-Agnostic Meta-Learning (MAML) [9] is trained on a variety of different learning tasks. Relation Net [34] is an effective end-to-end network for learning the relationship between testing and support images. Conceptually, this model is similar to Matching Network [36]. However, Relation Net leverages an additional deep neural network, so-called Similarity Network, which learns the similarity on top of the image descriptor generating network and produces so-called relation scores. SoSN [42] is somewhat similar to Relation Net [34] but it investigates second-order representations to capture co-occurrences of features and Power Normalizing functions whose role is to aggregate feature vectors. SalNet [43] proposes an efficient saliency-guided end-to-end sample hallucination strategy to generate richer representations via pairing foregrounds with different backgrounds. Graph Neural Networks (GNN) have also been applied to few-shot learning [10, 15, 12] with good results.

Second-order Statistics have been used in texture recognition [35] via so-called Region Covariance Descriptors (RCD) and in tracking [30] and object category recognition [20, 21]. Higher-order statistics have also been used in action classification from the body skeleton sequences [17] and domain adaptation [18]. Recently, second-order pooling has also benefited few-shot learning [42, 43, 40].

3 Background

Below we detail our notations and explain the process of computing second-order representations.

3.1 Notations

Let be a -dimensional feature vector. stands for the index set . Then we use to denote the

-mode super-symmetric rank-one tensor

generated by the -th order outer-product of , where the element of at the -th index is given by . Operator denotes vectorisation of a matrix or tensor. Typically, capitalised boldface symbols such as denote matrices, lowercase boldface symbols such as denote vectors and regular case such as , , or denote scalars, e. g.  is the -th coefficient of . Finally, if and 0 otherwise.

3.2 Second- and High-order Tensors

Below we show that second- or higher-order tensors emerge from a linearization of sum of Polynomial kernels.

Proposition 1

Let , be datapoints from two images and , and and be the numbers of data vectors e. g., obtained from the last convolutional feat. map of CNN for images and . Tensor feature maps result from a linearization of the sum of Polynomial kernels of degree :

Remark 1

In what follows, we will use second-order matrices obtained from the above expansion for built from datapoints and which are partially shifted by their means and so that , , and to account for so-called negative visual words which are the evidence of lack of a given visual stimulus in an image [21]. We define a (kernel) feature map333Note that (kernel) feature maps are not conv. CNN maps. with Power Norm. operator :


In this work,

is a zero-centered Sigmoid function:


where is the factor to control the slope of PN function, and all operations on the matrix are element-wise.

From Eq. (2), one can see that second-order pooling factors out the spatial mode of column feature vectors stacked in . Thus, images of various spatial sizes may produce various numbers of feature vectors, which will be pooled into a matrix of a constant size independent of the scale while rich co-occurrences of patterns are captured. Thus, second-order pooling lends itself to multi-scale problems e. g., we use one Relation Network irrespective of the scale (alternatively one can imagine we use the shared network parameters between three instances of Relation Network).

Figure 3: The architecture of Feature Encoder (Conv-4-64) and Similarity Network.

4 Approach

Although the use of multi-scale representations has been studied in low-level vision tasks, it has not been used in few-shot learning due to the difficulty posed when pairs of images need to be compared e. g., it is not obvious how to compare several feature sets formed from images at different resolutions. In conventional image classification, researchers generally believe that low-resolution images capture a subset of the information contained by their high-resolution counterparts. However, extracting the discriminative information from images may depend on the most expressive scale which differs from image to image. When learning to compare pairs of images, which is the main working mechanism between relation-based few-shot learning, one has to match features between the same object represented at different scales in the pair under comparison.

Inspired by diversity of multi-scaleobject relation and scale-mismatch problem in current few-shot learning methods, we propose to apply multi-scale strategy on several classic pipelines, namely Multi-scale Prototypical Net (MsPN), Multi-scale Relation Net (MsRN) and Multi-scale Second-order Similarity Network (MsSoSN), to verify our assumptions. Our proposed methods have been evaluated on the most popular publicly available few-shot learning datasets (not to be confused with any-shot or image classification datasets) on which we achieve the state-of-the-art accuracy.

4.1 Pipeline

Following we take MsSoSN shown in Figure 2 as example to illustrate our pipeline.

Let denote the image input at a given scale level . Specifically, , and correspond to , and resolutions. Moreover, we have:


where denotes the feature encoding network, refers to parameters of encoder network, are the corresponding convolutional features at the scale level denoted by index .

Subsequently, we apply second-order pooling for convolutional features at scales as follows:


Now, we are ready to pass to Relation Network to model image relations at multiple scales. For the -way -shot problem, we assume one support image with its image descriptor and one query image with its image descriptor . In general, we use ‘’ to indicate query-related variables. Moreover, each of the above descriptors belong to one of classes in the subset that forms so-called -way learning problem and the class subset is chosen at random from . Then, the -way -shot multi-scale learning becomes a similarity learning:


where is the relation score for a given the pair of image at scale and image at scale , refers to the Relation Network, and denotes network parameters that have to be learnt. is the descriptor/operator on features of image pairs, and in this paper it is a simple ‘concatenation’ along the third mode which yields a three-mode tensor.

Normally, we only measure the relations between support and query images from the same scale. However, to more effectively address the scale mismatch problem mentioned above, we also investigate in our paper how to measure the similarities between images from different scales via a Cross Reference (CrossRef) mechanism. In this way, the representations of objects of interest can be shifted to the best matching pairs of scales, thus help to remove the scale mismatch resulting in more accurate object relations.

The objective of MsSoSN is given by:


where are input images for the scale index , is the largest scale index (in our work ), denotes the index of support samples, denotes the index of query samples.

4.2 Scale Selector

As our Feature Encoder processes images of different scales, we want to select the dominant most discriminative scale for each given support-query pair of images, thus we leverage an attention module to re-weight representations for each scale level. We propose so-called Scale Selector (SS), which is a gated attention module with the ability to learn and select scales for each image pair that is scored for the similarity.

To be more precise, as different visual concepts may be expressed by their constituent parts, each appearing at a different scale, we allow a soft attention which selects a Mixture of Dominant Scales (MDS). Moreover, as second-order representations are used as inputs to our attention network, the network select MDS for co-occurring features (which may correspond to image parts). We re-define as follows:


where refers to Scale Selector, denotes its the network parameters, refers to the index of support samples, refers to the index of query samples.

Figure 4: The network architecture of Scale Selector. Given second-order features , we apply two pooling layers with kernel size 8 to contract the features to a scalar.

As the norm, a potential sparse regularizer for SS, does not take the class relations between support and query samples into account, we propose a simple regularization term to force pairs of samples from the same class to follow a similar MDS distribution. For pairs of samples from different classes, distributions are encouraged to differ. We have:


4.3 Self-supervised Scale and Scale Discrepancy

4.3.1 Scale Discriminator.

To produce more discriminative multi-scale representations, we employ self-supervision and thus we design a MLP-based Scale Discriminator (SD) which recognizes the scales of training images. The network architecture of SD is shown in Figure 5.

Figure 5: Scale Discriminator with 3 fully-connected layers.

Specifically, we feed second-order representations to the SD module and assign labels 1, 2 or 3 for , or images, respectively. We apply cross-entropy loss to train the SD module and classify the scale corresponding to given second-order feature matrix.

Given which is the second-order representation of , we vectorize them via and forward to the SD module to predict the scale index . We have:


where refers to the scale discriminator, are parameters of , and are the scale prediction scores for . We go over all corresponding to support and query images in the mini-batch and we use cross-entropy to learn the parameters of Scale Discriminator:


where enumerate over scale indexes.

4.3.2 Discrepancy Discriminator.

As relation learning requires comparing pairs of images, we propose to model scale discrepancy between each support-query pair by assign a discrepancy label to each pair. Specifically, we assign label where and denote the scales of a given support-query pair. Then we train so-called Discrepancy Discriminator (DD) to recognize the discrepancy between scales. DD uses the same architecture as Similarity Network in Figure 3 as the input to DD are support-query matrix pairs, thus:


where refers to scale discrepancy discriminator, are the parameters of , are scale discrepancy prediction scores, is concat. in mode 3. We go over all support+query image indexes in the mini-batch and we apply the cross-entropy loss to learn the discrepancy labels:


where where enumerate over scale indexes.

Final Loss. The total loss combines the proposed Scale Selector, Scale Discriminator and Discrepancy Discriminator:


where are the hyper-parameters that control the impact of the regularization and each individual loss component.

Model Backbone 1-shot 5-shot Matching Nets [36] - Prototypical Net [33] Conv–4–64 MAML [9] Conv–4–64 Relation Net [34] Conv–4–64 GNN [10] Conv–4–64 SoSN [42] Conv–4–64 MAML++ [2] Conv–4–64 MetaOpt [25] Conv–4–64 SalNet [43] Conv–4–64 SoSN [42] ResNet-12 TADAM [29] ResNet-12 LwoF [11] WRN-28-10 MsSoSN Conv–4–64 MsSoSN+SS Conv–4–64 MsSoSN+SD Conv–4–64 MsSoSN+DD Conv–4–64 MsSoSN+SS+SD+DD Conv–4–64 1-4[1pt/3pt] MsSoSN(CrossRef) Conv–4–64 MsSoSN(CrossRef)+SS+SD+DD Conv–4–64

Table 1: Evaluations on the miniImagenet dataset (5-way acc. given). MsSoSN uses multi-scale images, that is , and images.

Flower-102 CUB-200-2011 Food-101 Model 1-shot 5-shot 1-shot 5-shot 1-shot 5-shot Prototypical Net [33] Relation Net [34] SoSN [42] 1-7[1pt/3pt] SoSN SoSN MsSoSN MsSoSN(CrossRef) MsSoSN+SS MsSoSN+SD MsSoSN+DD MsSoSN+SS+SD+DD

Table 2: Evaluations on three fine-grained classification datasets, Flower-102, CUB-200-2011 and Food-101 (5-way acc. given). Refer to [34, 42] for details of baselines listed in this table.

Model 1-shot 5-shot [33] MsPN-2 Scale MsPN-3 Scale 1-3[1pt/3pt] [34] MsRN-3 Scale MsRN-3 Scale 1-3[1pt/3pt] [42] MsSoSN-2 Scale MsSoSN-3 Scale

Table 3: Ablation study – the effect of multi-scale design on three different baseline models and the performance of different scale numbers on the accuracy on the miniImagenet dataset (5-way acc. given).

5 Experiments

Model way Relation Net [34] 5 SoSN [42] SoSN MsSoSN 1-14[1pt/3pt] MsSoSN(CrossRef) Relation Net [34] 20 SoSN [42] SoSN MsSoSN Relation Net [34] 30 SoSN [42] SoSN MsSoSN   p1: shn+hon+clv, p2: clk+gls+scl, p3: sci+nat, p4: shx+rlc. Notation xy means training on exhibition x and testing on y.

Table 4: Evaluations on the Open MIC dataset (Protocol I) (given 1-shot learning accuracies). (http://claret.wikidot.com).

Below we demonstrate usefulness of our proposed Multi-scale Second-order Relation Network. Our method is evaluated on the miniImagenet [36] and tiered–Imagenet [31] datasets, the recent Open MIC dataset [19], and three fine-grained classification datasets in the one- and few-shot learning setting. The network is trained with the Adam solver. The layer configurations of our proposed MsSoSN model, Scale Selector and Scale Discriminator are shown in Figures 3, 4, and 5. To run the multi-scale pipeline, we use images of size , and . The results are compared against several state-of-the-art methods for one- and few-shot learning.

5.1 Datasets

Below, we describe our experimental setup, datasets and evaluations.

miniImagenet [36] consists of 60000 RGB images from 100 classes. We follow the standard protocol, which uses 64 classes for training, 16 classes for validation and the remaining 20 classes for testing.

tiered–Imagenet [31] consists of 608 classes from ImageNet. We follow the protocol that uses 351 base classes, 96 validation classes and 160 novel test classes.

Open MIC is the Open Museum Identification Challenge (Open MIC) [19], a recent dataset with photos of various museum exhibits, e. g. paintings, timepieces, sculptures, glassware, relics, science exhibits, natural history pieces, ceramics, pottery, tools and indigenous crafts, captured from 10 museum spaces according to which this dataset is divided into 10 subproblems. In total, it has 866 diverse classes and 1–20 images per class. We combine (shn+hon+clv), (clk+gls+scl), (sci+nat) and (shx+rlc) into subproblems p1, , p4. We form 12 possible pairs in which subproblem is used for training and for testing (xy).

Fine-grained Datasets.

Flower-102 [28], a fine-grained category recognition dataset, contains 102 classes of various flowers. Each class consists of 40-258 images. We randomly select 80 classes for training and 22 classes for testing.

Caltech-UCSD-Birds 200-2011 (CUB-200-2011) [37] has 11788 images for 200 bird species. 150 classes are randomly selected for training and the rest for testing.

Food-101 [4] has 101000 images in total and 1000 images per category. We choose 80 classes for training and 21 classes for testing.

Model 1-shot 5-shot MAML [9] Prototypical Net [33] Relation Net [34] [42] 1-3[1pt/3pt] MsSoSN MsSoSN+SS+SD+DD

Table 5: Top-1 accuracies on the novel test classes of the tiered–Imagenet dataset (5-way acc. given).

5.2 Performance Analysis

The evaluation results on miniImagenet [36] are illustrated in Table 1. Our approach achieves the state-of-the-art performance among all methods based on ’Conv-4-64’ backbone. We note that applying larger scale image inputs is helpful in improving the accuracy of few-shot classification. Around improvement is achieved simply by increasing the image resolution from to . Furthermore, once we feed our MsSoSN with images generated at three scales, the performance on 5-way 1- and 5-shot protocols increases by another without any modification to our network architecture or increase of the number of network parameters, which demonstrates the effectiveness of our multi-scale learning strategy in few-shot learning. On tiered–Imagenet, our proposed model obtains and improvement for 1-shot and 5-shot protocols, respectively. Kindly refer to Table 5 for more state-of-the-art results and baselines which are ’Conv-4-64’-based approaches.

Our proposed network has also been evaluated on the Open MIC dataset [19] and classical fine-grained image classification datasets e. g., Flower102, CUB-200-2011 and Food101, in few-shot setting. From Tables 2 and 4, one can see that our proposed method performs better and better as we add the Scale Selector (SS), Scale Discriminator (SD) and Discrepancy Discriminator (DD). For instance, on Flower-102, one can see an 1% gain by adding SS to MsSoSN, an 0.8% gain by adding SD to MsSoSN, an 1.4% gain by adding DD to MsSoSN, while the combination of MsSoSN and SS+SD+DD yields 3% improvement over MsSoSN and 8.4% improvement over SoSN.

Ablation Experiments. We compare the impact of the number of scales on the accuracy in Table 3. Combining multiple scale information improves the accuracy over simply using single-scale inputs for both Prototypical Net, Relation Net and SoSN by up to 2-3.5%. resolution lets us perform several downsampling steps without reducing images to unreasonably small resolutions. We investigate the 2-scale combination () and 3-scale combination () in MsPN, MsRN and MsSoSN. Taking MsSoSN as example, the 2-scale network outperforms the single scale network by and for 1-shot and 5-shot learning, respectively. When we increase number of scales to 3, the 1-shot and 5-shot results are further improved by 0.6%.

Cross Reference over Scales. We want to use the Cross Reference (CrossRef) mechanism to address the scale-mismatch cases in several datasets. Table 1 shows that the CrossRef effectively improves the performance on miniImagenet, which is consistent with our observations that scale-mismatch is a major reason for those misclassified cases. However, according to Table 2, this strategy works less robustly on the fine-grained datasets as the scales across of images of these datasets are well aligned, thus the CrossRef can even lead to performance drop.

Scale Selector. The objective of Scale Selector is to help the network learn how to select scales based on the second-order representations. We observe in our experiments that using the Scale Selector can further improve results for 1-shot and 5-shot protocols on miniImagenet by and , respectively. The results on Flower102, Food101 and Open MIC also gain around by the use of SS.

Self-supervised Scale and Discrepancy Discriminators.

Self-supervisory cues are known to boost the performance of image recognition models due to additional regularization they provide. Applying a Scale Discriminator to perform self-supervised learning appears to be an easy and obvious choice for our multi-scale model has barely any impact on the network complexity or training times. According to our evaluations, the scale discriminator improves the 1-shot accuracy by

and 5-shot accuracy by on the mini

Imagenet dataset. Discrepancy Discriminator works in a similar manner by introducing self-supervisory cues, however, it learns to recognize the relative scale labels between support-query image pairs thus capturing the scale variance. Table

1 shows that that applying Discrepancy Discriminator improve 1-shot and 5-shot accuracy by around . Figure 6 shows the impact of hyper-parameters on the accuracy.

Finally, without any pre-training, MsSoSN with Scale Selector, Scale Discriminator and Discrepancy Discriminator outperforms all state-of-the-art methods based on the ’Conv-4-64’ backbone. For miniImagenet, we get and top-1 accuracy for 5-way 1-shot and 5-shot prot. For tiered–Imagenet, we get and accuracy.

Figure 6: The influence of hyper-parameters and on the accuracy on miniImagenet (5-way top-1 acc. given).

6 Conclusions

In this paper, we propose a novel Multi-scale Second-order Relation Network that benefits from similarity learning at multiple scales. We propose Scale Selector in the network to re-weight scale representations in self-learning manner. Furthermore, we investigate how to leverage self-supervised learning in few-shot learning scenario by the use of so-called Scale and Discrepancy Discriminators, which are easily obtainable self-supervisory cues which regularize our network, thus helping it generalize better. Our experiments demonstrate the usefulness of the proposed Multi-scale Second-order Relation Network in capturing accurate image relations. Note that our ’Conv-4’-based MsSoSN outperforms not only the ’Conv-4’-based methods, but also other state-of-the-art models built upon much deeper backbones e. g., ResNet-12 and WRN-28-10, with much less training costs.

Acknowledgements. This research is supported by the China Scholarship Council (CSC Student ID 201603170283). We also thank CSIRO Scientific Computing, NVIDIA (GPU grant) and National University of Defense Technology for their support.


  • [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2013) Label-embedding for attribute-based classification. CVPR, pp. 819–826. Cited by: §2.1.
  • [2] A. Antoniou, H. Edwards, and A. Storkey (2018) How to train your maml. arXiv preprint arXiv:1810.09502. Cited by: Table 1.
  • [3] E. Bart and S. Ullman (2005) Cross-generalization: learning novel classes from a single example by feature replacement.. CVPR, pp. 672–679. External Links: ISBN 0-7695-2372-2 Cited by: §2.1.
  • [4] L. Bossard, M. Guillaumin, and L. Van Gool (2014)

    Food-101 – mining discriminative components with random forests

    In European Conference on Computer Vision, Cited by: §5.1.
  • [5] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu (2012) Semantic Segmentation with Second-Order Pooling.. ECCV. Cited by: §1.
  • [6] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth (2009) Describing objects by their attributes. CVPR, pp. 1778–1785. Cited by: §2.1.
  • [7] L. Fei-Fei, R. Fergus, and P. Perona (2006) One-shot learning of object categories. PAMI 28 (4), pp. 594–611. Cited by: §2.1, §2.1.
  • [8] M. Fink (2005) Object classification from a single example utilizing class relevance metrics. NIPS, pp. 449–456. Cited by: §2.1.
  • [9] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pp. 1126–1135. Cited by: §2.1, §2.1, Table 1, Table 5.
  • [10] V. Garcia and J. Bruna (2017) Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043. Cited by: §2.1, §2.1, Table 1.
  • [11] S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 4367–4375. Cited by: §2.1, Table 1.
  • [12] S. Gidaris and N. Komodakis (2019-06)

    Generating classification weights with gnn denoising autoencoders for few-shot learning

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §2.1.
  • [13] K. Guo, P. Ishwar, and J. Konrad (2013) Action recognition from video using feature covariance matrices. Trans. Img. Proc. 22 (6), pp. 2479–2494. External Links: ISSN 1057-7149 Cited by: §1.
  • [14] M. Harandi, M. Salzmann, and R. Hartley (2017) Joint dimensionality reduction and metric learning: a geometric take. ICML, pp. 1404–1413. Cited by: §1.
  • [15] J. Kim, T. Kim, S. Kim, and C. D. Yoo (2019-06) Edge-labeling graph neural network for few-shot learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §2.1.
  • [16] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, Vol. 2. Cited by: §2.1, §2.1.
  • [17] P. Koniusz, A. Cherian, and F. Porikli (2016) Tensor representations via kernel linearization for action recognition from 3d skeletons. In ECCV, pp. 37–53. Cited by: §1, §2.1.
  • [18] P. Koniusz, Y. Tas, and F. Porikli (2017) Domain adaptation by mixture of alignments of second-or higher-order scatter tensors. In CVPR, Vol. 2. Cited by: §2.1.
  • [19] P. Koniusz, Y. Tas, H. Zhang, M. Harandi, F. Porikli, and R. Zhang (2018) Museum exhibit identification challenge for the supervised domain adaptation and beyond. ECCV, pp. 788–804. Cited by: §5.1, §5.2, §5.
  • [20] P. Koniusz, F. Yan, P. Gosselin, and K. Mikolajczyk (2017) Higher-order occurrence pooling for bags-of-words: visual concept detection. PAMI 39 (2), pp. 313–326. Cited by: §1, §2.1.
  • [21] P. Koniusz, H. Zhang, and F. Porikli (2018) A deeper look at power normalizations. In CVPR, pp. 5774–5783. Cited by: §1, §2.1, Remark 1.
  • [22] M. Köstinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof (2012) Large scale metric learning from equivalence constraints.. CVPR, pp. 2288–2295. External Links: ISBN 978-1-4673-1226-4 Cited by: §1.
  • [23] B. M. Lake, R. Salakhutdinov, J. Gross, and J. B. Tenenbaum (2011) One shot learning of simple visual concepts. CogSci. Cited by: §2.1, §2.1.
  • [24] H. Larochelle, D. Erhan, and Y. Bengio (2008) Zero-data learning of new tasks.. AAAI 1 (2), pp. 3. Cited by: §2.1.
  • [25] K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019) Meta-learning with differentiable convex optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10657–10665. Cited by: Table 1.
  • [26] F. F. Li, R. VanRullen, C. Koch, and P. Perona (2002) Rapid natural scene categorization in the near absence of attention. Proceedings of the National Academy of Sciences 99 (14), pp. 9596–9601. External Links: Document, ISSN 0027-8424 Cited by: §2.1.
  • [27] E. G. Miller, N. E. Matsakis, and P. A. Viola (2000) Learning from one example through shared densities on transforms. CVPR 1, pp. 464–471. External Links: Document, ISSN 1063-6919 Cited by: §2.1.
  • [28] M-E. Nilsback and A. Zisserman (2008-12) Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Cited by: §5.1.
  • [29] B. Oreshkin, P. R. López, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721–731. Cited by: Table 1.
  • [30] F. Porikli and O. Tuzel (2006) Covariance tracker. CVPR. Cited by: §1, §2.1.
  • [31] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. In Proceedings of 6th International Conference on Learning Representations ICLR, Cited by: §5.1, §5.
  • [32] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2018) Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960. Cited by: §2.1.
  • [33] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In NIPS, pp. 4077–4087. Cited by: §1, §2.1, §2.1, Table 1, Table 2, Table 3, Table 5.
  • [34] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2017) Learning to compare: relation network for few-shot learning. CoRR:1711.06025. Cited by: §1, §2.1, §2.1, Table 1, Table 2, Table 3, Table 4, Table 5.
  • [35] O. Tuzel, F. Porikli, and P. Meer (2006) Region covariance: A fast descriptor for detection and classification. ECCV. Cited by: §1, §2.1.
  • [36] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In NIPS, pp. 3630–3638. Cited by: §1, §2.1, §2.1, Table 1, §5.1, §5.2, §5.
  • [37] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §5.1.
  • [38] L. Wang, P. Koniusz, and D. Q. Huynh (2019-09) Hallucinating idt descriptors and i3d optical flow features for action recognition with cnns. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • [39] K. Q. Weinberger, J. Blitzer, and L. K. Saul (2006) Distance metric learning for large margin nearest neighbor classification. NIPS, pp. 1473–1480. Cited by: §1.
  • [40] D. Wertheimer and B. Hariharan (2019) Few-shot learning with localization in realistic settings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6558–6567. Cited by: §1, §2.1, §2.1.
  • [41] R. S. Woodworth and E. L. Thorndike (1901) The influence of improvement in one mental function upon the efficiency of other functions. Psychological Review (I) 8 (3), pp. 247–261. External Links: Document Cited by: §2.1.
  • [42] H. Zhang and P. Koniusz (2019) Power normalizing second-order similarity network for few-shot learning. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1185–1193. Cited by: §1, Figure 2, §2.1, §2.1, §2.1, Table 1, Table 2, Table 3, Table 4, Table 5.
  • [43] H. Zhang, J. Zhang, and P. Koniusz (2019) Few-shot learning via saliency-guided hallucination of samples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2770–2779. Cited by: §1, §2.1, §2.1, §2.1, Table 1.