All you need is a good representation: A multi-level and classifier-centric representation for few-shot learning

11/28/2019 ∙ by Shaoli Huang, et al. ∙ The University of Sydney 22

The main problems of few-shot learning are how to learn a generalized representation and how to construct discriminant classifiers with few-shot samples. We tackle both issues by learning a multi-level representation with a classifier-centric constraint. We first build the multi-level representation by combining three different levels of information: local, global, and higher-level. The resulting representation can characterize new concepts with different aspects and present more universality. To overcome the difficulty of generating classifiers by several shot features, we also propose a classifier-centric loss for learning the representation of each level, which forces samples to be centered on their respective classifier weights in the feature space. Therefore, the multi-level representation learned with classifier-centric constraint not only can enhance the generalization ability, but also can be used to construct the discriminant classifier through a small number of samples. Experiments show that our proposed method, without training or fine-tuning on novel examples, can outperform the current state-of-the-art methods on two low-shot learning datasets. We further show that our approach achieves a significant improvement over baseline method in cross-task validation, and demonstrate its superiority in alleviating the domain shift problem.



There are no comments yet.


page 1

page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 2: Our approach consists of two steps to perform few-shot learning task. The first step is representation learning, where we construct a local, global, and higher-level feature extractors. After training the network model with classifier-centric loss, we use the concatenated feature as classifier weights to perform few-shot learning task.

Deep learning models are generally very data hungry, though it has achieved remarkable success in visual recognition tasks [28], even exceed human-level performance [9]

. This means when a machine vision system is required to recognize a new concept, we need to retrain the recognition models with a considerable amount of new labeled data. In contrast, human can quickly learn new concepts from just a few examples. For instance, a child can recognize an animal immediately after seeing it once or twice. To equip the machine learning algorithms with such a quick learning ability, there has been a recent resurgence of interest in the research problem of few-shot learning

[6, 8, 14, 25, 19, 24, 13].

Learning a generalized representation is an indispensable step in few-shot learning. Recent work [22, 14, 7, 24] on this problem mainly extract the features from the last feature layer of deep ConvNet model to represent a novel sample. However, the ConvNet model is generally trained on the data of base classes, so the feature representation of its last layer is highly correlated with the base class. When there is a large difference between the novel categories and the base categories, the representation ability for novel samples is even worse. This view can be supported by recent findings on analyzing the transferability of ConvNets features. Yosinski et al. [26] ’s work suggests that higher layer activations with higher specialization to source tasks are less transferable to target tasks. Azizpour et al. [2] experimentally shows feature transferability is highly correlated with the distance of the target task from the source task of the trained ConvNet. In addition, such a representation often only exhibit the global visual patterns of a novel example and is insufficient to characterize the corresponding conceptual meanings.

For example, as shown in the first row of Fig.1

, given a novel sample , it is difficult to determine whether other samples are of the same class as the sample by only relying on the global features. Some samples require knowledge from local visual features or even higher-level views to match the sample. Thus, we argue that, given a novel example, using a multi-level representation can better describe the corresponding concept. Furthermore, we propose to form this multi-level representation by incorporating local feature from the earlier layer and higher-level feature from the softmax output with the global feature. The earlier layers have been demonstrated to be more effective for transfer learning when the target task has a further distance from the source task

[21], while the softmax output resides rich information regarding similarity structure over the data [10].

Building a discriminative classifier for few-shot learning is more challenging than representation learning, since it highly relies on the novel samples of a limited amount. Existing methods to this problem can be divided into two groups: learning-based [16, 6, 22] and feature-based [14, 7]. The former normally train a model that learn the classifier weights of novel categories from scratch, while the later directly use the features of novel samples to approximate the classifier weights. In this work, we focus on the study on the feature-based method of producing classifier weights, because it is more simple, flexible, and straightforward. Most recent works of this group show that by training the ConvNet model with cosine softmax loss, the resulting feature embedding can be directly used as classifier weights and yields promising performance on several few-shot benchmark datasets. However, similar to ordinary softmax loss, feature and classifier learned with cosine softmax loss are not well constrained. As illustrated in the second row of Fig.1, we consider a possible case, in which the cosine softmax loss can converge, but using the learned embeddings of some samples as classifier weights may fail to produce a fine decision boundary. To tackle this issue, we introduce a classifier-centric constraint in the feature representation learning stage, pushing the sample features to be centered on their respective classifier weights. In other words, since the classifier learned from a deep ConvNet model is always discriminative, if we align the feature center to the classifier weight in the representation learning stage, the resulting representation can be used to generate a more discriminative classifier for few-shot samples.

Overall, we propose a multi-level and classifier-centric representation to tackle the problem of few-shot learning. The representation captures local-level cues, global-level patterns, and the similarity structure over base class data, thereby presenting more universality to better characterize novel concepts. Also, with this representation, few-shot samples can be used to approximate the classifier weights of novel concepts. Finally, We extensively evaluate our approach on both the challenging ImageNet low-shot classification benchmark and CUB-200-2011. Experiments show that our proposed method significantly outperforms state-of-the-art methods when very few training examples are available. In the following sections, we provide related work in section 2, we detail our few-shot representation learning methodology in section 3, we present experimental results in section 4, and finally conclude in section 5.

2 Related work

2.1 Few-shot Learning

Recently proposed approaches to few-shot learning problem can be divided into meta-learning based [16, 6, 8] and metric learning based approaches [11, 22, 14, 7].

Meta-learning based methods tackle the few-shot learning problem by training a meta-learner to help a learner can effectively learn a new task on very few training data [12, 15, 12, 17, 6, 8]. Most of these methods are normally designed based on some standard practices for training deep models on limited data, such as finding good weights initialization [6] or performing data augmentation [8] to prevent overfitting. For instance, Finnn et al. [6] propose to learn a set of parameters to initialize the learner model so that it can be quickly adapted to a new task with only a few gradient descent steps; [8] deal with the data deficiency in a more straightforward way, in which a generator is trained on meta-training data and used to augment feature of novel examples for training the learner. Another line of work addresses the problem in a ”learning-to-optimize” way [17, 16]. For example, Ravi et al. [16] train an LSTM-based meta-learner as an optimizer to update the learner and store the previous update records into the external memory. Though this group of methods achieves promising results, they either require to design complex inference mechanisms [5] or to further train a classifier for novel concepts [16, 6]

. In contrast, by imposing a classifier-centric constraint in the representation learning stage, we can construct a stable decision boundary directly by using the feature vectors of few novel examples without the need to train the classifiers for novel concepts.

Metric-learning based methods

mainly learn a feature space, in which images are easy to classify using a distance-based classifier such as cosine similarity or nearest neighbor. To do so, Koch

et al. [11] trains a Siamese network that learns a metric space to perform comparisons between images. Vinyals et al. [22] propose Matching Networks to learn a contextual embedding,with which the label of a test example can be predicted by looking for its nearest neighbors from the support set. Prototypical networks [19] determine the class label of a test example by measuring the distance from all the class means of the support set. Since the distance functions of these two works are predefined, [25] further introduce a learnable distance metric for comparing query and support samples.

The most related methods to ours are [14, 7, 3]. These approaches learn a feature representation by a cosine softmax loss such that the resulting feature vector of novel examples can be used to construct the classifier weights. By comparison, we further introduce a classifier-centric constraint that explicitly enforces the features of samples close to their class classifier weights in the representation learning stage. By doing so, the representation learned by our approach can be used to construct a more stable decision boundary by some random examples. In addition, these methods only learn a single level of representation, resulting in a limited ability to represent novel categories, while ours constructs a multi-level representation that captures multiple levels of knowledge, thereby presenting stronger ability to characterize novel concepts.

2.2 Analyzing the transferability of ConvNets.

Deep learning models are quite data-hungry but nonetheless transfer learning have been proven highly effective to avoid over-fitting when training larger models on smaller datasets[4, 27, 18]. These findings raise interest in studying the transferability of deep models features in recent years. Yosinski et al. [26] experimentally show how transferable of each layer by quantifying the generality versus specificity of its features from a deep ConvNet, and suggest that higher layer activations with higher specialization to source tasks are less transferable to target tasks. Pulkit et al. [1] investigates several aspects that impact the performance of ConvNet models for object recognition. Hossein et al. [2]

identifies several factors that affect the transferability of ConvNet features and demonstrates optimizing these factors aid transferring task. However, these works mainly explore the transferability and generalization ability of ConvNet features in terms of target datasets where the training samples are much more than the few-shot setting. In this work, we investigate the capacities of the intermediate layer, last feature layer, and softmax logits to perform few-shot learning tasks.

3 Methodology

As shown in Fig. 2, our proposed method addresses the few-shot learning problem using a two-stage pipeline: representation learning and constructing novel classifiers. In the first stage, we construct three feature extractors from different layer of a typical ConvNet model, which are named as local , global , and higher-level feature extractor, we then learn these representations jointly on base-class data by a cosine softmax loss with a classifier-centric constraint. Once the representations are learned, we extract three levels of features from novel examples and concatenate them to construct classifier weights.

3.1 Representation Learning

3.1.1 Classifier-Centric Loss

Cosine Softmax Loss. The key idea of metric-learning based methods [11, 19] is to learn a mapping function that maps samples into a embedding space, in which similar sample are close while dissimilar ones are far away. Then an embedded point

can be classified by a softmax classifier, which usually refers to the last fully connected layer followed by softmax layer. Such mapping and classifier weights can be learned by minimizing the cross entropy loss:


where is the column of the the weight matrix of the softmax classifier.

The most recent works [14, 7] demonstrate that training a ConvNet model with cosine similarity classifier leads the learned representation to generalize better on novel categories. Here we can modify the softmax classifier to the cosine similarity classifier by applying -normalization on both embedding vector and weight vector:


After normalization, as , the cosine similarity classifier fails to produce a one-hot categorical distribution, so a trainable scale factor is usually used along with cosine softmax classifier. In this case, the cosine softmax loss can written as:


Classifier-Centric Constraint. [14, 7]

assume that the samples of the same class are concentrated in the feature space learned with cosine softmax loss, then the feature embedding of some random samples can be used to approximate the classifier weights. However, this assumption is not strictly held in some cases, such as data with large intraclass variance and small inter-class variance might tend to be scattered in the feature space. To ensure that using one or few embedded points of each category can construct a stable decision boundary, we explicitly constraint a feature point

should be near its classifier weight after the classifier is learned, and the constraint loss is given by


and the total loss is given by


where is training images, is the labels, and is a weighting parameter, is scale factor from Eq.3. Since it is hard to train the deep ConvNet model by directly optimizing over the classifier-centric loss, in practise, we instead adopt an pre-training strategy, that is, we first train the model with the cosine softmax loss then perform fine-tuning with the loss . Noted that, in the following subsection, learning each level of representation is mainly based on this strategy.

3.1.2 Local and global feature extractor.

Though earlier layers in a ConvNet model show less discriminative ability than later layers, they are learned to capture subpart features which are more general knowledge and less specific to the training task. This information is essential for characterising novel concepts, especially when there is domain shift between base classes and novel classes. Therefor, we build a local feature extractor

that extract feature from earlier layers. To do this, we add a global max pooling layer on top of earlier convolutional layer, and concatenate all the pooling features from these layers and fed them to fully connected layer. Then the local representation is learned with the loss

obtained by substituting Eq.5, where is the weight matrix of the classifier. We also extract feature from the penultimate layer which is usually a global pooling layer. This layer provides information about the entire image, thus we name the representation learned from it the global feature extractor . Similarly, the global representation can be learned with the loss .

3.1.3 Higher-level feature extractor.

ConvNet models typically use a softmax function to produce class probabilities. Given an input logit

, softmax function converts the logits into a class probability by comparing it with the other logits. That is


Here, is called a temperature parameter. Normally a higher temperature is used to learn knowledge distillation [10]

, because it leads to the softmax function producing a more soft probability distribution over classes, and this distribution capture a rich similarity structure over the data. Therefore, we borrow this idea of setting a higher temperature to the softmax function to learn a higher-level feature embedding

that encode rich information about how the input relates to classes. In our framework, we copy the output from the global feature extractor and multiply it by a constant parameter . Then this scaled feature vector is fed to a softmax layer followed by a fully connected layer to extract the feature . Thus, we learn the higher-level representation with the loss .

Overall, the multi-level and classifier-centric representation can be learned by minimizing the following loss function


3.2 Constructing classifier weights

In this section, we show how the learned representation from previous section can be used to construct classifier weights through few examples. We first assume that we have learned the three representations ,,and , and their base classifier weight matrixes ,,and . Then, as illustrated in Fig. 2, the multi-level representation is obtained by concatenating the three, which is


the overall weight matrix of base classifiers is also obtained by concatenating their classifier weights accordingly.


Given an input training sample from a novel class , we extract the feature vector by and directly use it as the classifier weight for the novel class . We then extend the base classifiers by inserting weight column into the classifier weight matrix , so that the whole system is able to recognize the novel concept .

If there are more than one training examples available for the novel class , we use average embedding with the same way as in [14]. Given training examples , we first compute the average embedding by


and finally obtain the weight vector by nomalization


4 Experiments

4.1 Datasets


Method n=1 2 5 10 20 Classifier
Baseline [14, 7] 53.96 62.88 69.55 71.56 73.42 81.80
Baseline + Classifier-Centric 69.93 74.94 78.30 78.99 79.68 81.71
Table 1: Classification accuracy of CUB validation set using samples as the classifier in two feature spaces.


Novel / Novel Novel / All All
Method n=1 2 5 10 20 n=1 2 5 10 20 n=1 2 5 10 20


Pro. Nets [19] (from [1])
39.4 54.4 66.3 71.2 73.9 - - - - - 49.5 61.0 69.7 72.9 74.6

Log. Reg. (from [24])
38.4 51.1 64.8 71.6 76.6 - - - - - 40.8 49.9 64.2 71.9 76.9

Log. Reg w/G. (from [24])
40.7 50.8 62.0 69.3 76.5 - - - - - 52.2 59.4 67.6 72.8 76.9

Pro. Mat. Nets [24]
43.3 55.7 68.4 74.0 77.0 - - - - - 55.8 63.1 71.1 75.0 77.1

Pro. Mat. Nets w/G [24]
45.8 57.8 69.0 74.3 77.4 - - - - - 57.6 64.7 71.9 75.2 77.5

SGM w/G. [8]
- - - - - 32.8 46.4 61.7 69.7 73.8 54.3 62.1 71.3 75.8 78.1

Batch SGM [8]
- - - - - 23.0 42.4 61.9 69.9 74.5 49.3 60.5 71.4 75.8 78.5

Mat. Nets [22](from [8, 24])
43.6 54.0 66.0 72.5 76.9 41.3 51.3 62.1 67.8 71.8 54.4 61.0 69.0 73.7 76.5

Wei. Imprint * [14]
44.05 55.42 68.06 73.96 77.21 38.70 51.36 65.89 72.60 76.21 56.73 63.66 71.04 74.05 75.47

Cos. Avg. Wei. Gen. [7]
45.23 56.90 68.68 74.36 77.69 39.33 50.27 63.16 69.56 73.47 54.65 64.69 72.35 76.18 78.46

Cos. Att. Wei. Gen. [7]
46.02 57.51 69.16 74.84 78.81 40.79 51.51 63.77 70.07 74.02 58.16 65.21 72.72 76.65 78.74

48.22 58.77 69.71 74.45 76.91 44.06 55.83 68.15 73.36 76.07 58.96 65.18 71.28 73.63 74.78

Ours with Aug
49.09 59.66 70.26 74.72 77.04 45.56 57.12 68.85 73.73 76.24 59.37 65.48 71.36 73.63 74.72

Mat. Nets [22] (from [24]) 53.5 63.5 72.7 77.4 81.2 - - - - - 64.9 71.0 77.0 80.2 82.7

Pro. Nets [19]
49.6 64.0 74.4 78.1 80.0 - - - - - 61.4 71.4 78.0 80.0 81.1

Pro. Mat. Nets w/G [24]
54.7 66.8 77.4 81.4 83.8 - - - - - 65.7 73.5 80.2 82.8 84.5
SGM w/G. (from [24]) - - - - - 45.1 58.8 72.7 79.1 82.6 63.6 71.5 80.0 83.3 85.2

57.12 68.28 77.77 81.80 83.72 53.48 65.05 76.59 80.95 83.07 67.49 73.36 79.87 81.98 82.95

Ours with Aug
57.97 69.08 78.19 81.99 83.80 54.82 66.93 77.12 81.22 83.16 68.01 74.72 79.98 81.99 82.88


Table 2: Comparison with the state-of-art methods on the Few-shot-ImageNet dataset.Best are bolded. * indicates the result is obtained from our own implementation. Aug means we get 5 random crops from each training example, then use the average feature as the weight of novel class.
Figure 3: Some successful exemplars using our proposed method. The first column shows a single training image of novel class, all images in the remaining three columns are correctly predicted by using the proposed multi-level representation. The second column shows some successful predictions using only global-level features but they are misclassified if using local or higher-level representation, and so on for the second and the third column.

In this section, we describe our experiments and compare our approach with existing methods on Few-shot-ImageNet [8] or Few-shot-CUB [14].

Few-shot-ImageNet is proposed by [8], where the ImageNet categories are devided into four subsets which contains 193 base categories,300 novel categories 196 base categories, and 311 novel categories respectively. The first two groups are made for validating hyper-parameters, the remaining two groups are used for the final evaluation. The performance on this benchmark is measured by the accuracy of novel test examples on all label spaces and accuracy of all test samples. Wang et al. [24] slightly change performance metic that the accuracy of novel test examples is measured within the novel label space. To fairly compare all results reported on this benchmarks, we report our results using both two performance metrics. In our experiments, we randomly select training images of the novel categories and repeat experiments 100 times, and finally report the mean accuracies within confidence intervals.


Novel / Novel Novel / All All

n=1 2 5 10 20 n=1 2 5 10 20 n=1 2 5 10 20

Inception V1
Gen. + Cla.[8] )from [14]) - - - - - 18.56 19.07 20.00 20.27 20.88 45.42 46.56 47.79 47.88 48.22

Mat. Nets [22](from [14])
- - - - - 13.45 14.75 16.65 18.18 25.77 41.71 43.15 44.46 45.65 48.63
Imprinting [14] - - - - - 21.26 28.69 39.52 45.77 49.32 44.75 48.21 52.95 55.99 57.47
Imprinting +Aug [14] - - - - - 21.40 30.03 39.35 46.35 49.80 44.60 48.48 52.78 56.51 57.84
Ours 32.35 39.78 49.47 54.67 57.37 30.72 37.65 48.17 53.56 56.45 49.80 53.41 57.87 60.46 61.61
Ours + Aug 33.56 40.82 50.28 54.67 57.53 30.87 39.01 49.17 53.66 56.61 49.96 53.73 58.18 60.30 61.60

Imprinting +Aug [14](Resnet50*)
32.15 40.48 52.41 57.93 61.72 26.24 35.79 49.31 55.31 59.38 52.43 56.83 62.89 65.53 67.27
Ours 35.91 44.91 56.95 62.48 66.01 33.54 43.47 56.21 61.96 65.61 55.45 59.58 64.94 67.32 68.78
Ours + Aug 36.96 45.53 57.43 63.03 66.35 34.91 44.21 56.81 62.52 65.96 55.60 59.66 65.02 67.46 68.89

Table 3: Comparison with the state-of-art methods on the Few-shot-Cub dataset.Best are bolded. * indicates the result is obtained from our own implementation. Aug means we get 5 random crops from each training example, then use the average feature as the weight of novel class.

Few-shot-CUB contains 200 fine-grained categories of birds with 11,788 images [23]. Qi et al. [14] construct a low-shot setup on this dataset by using the first 100 classes as base classes and the remaining 100 classes as novel classes. Since each category of this data contains only about 30 images, we repeat 20 experiments and take the average top-1 accuracy.

4.2 Network architecture and training details

Network architecture To fairly compare with previous methods on imageNet based few-shot benchmarks, we use ResNet-10 [9] architecture in our learning framework. We also show some results using the deeper architecture ResNet-50 [9]. For experiments on CUB-200-2011, Qi et al. [14] report their results obtained based on InceptionV1 [20], therefore we first implement our method based on InceptionV1 architecture to be fairly compared with. We also implement the method et al. [14] so that it can be compared to our method based on the Resnet50 network structure.

Training details

For all experiments on imageNet based few-shot benchmarks, we trained our model from scratch for 90 epochs on the base classes. The learning rate starts from 0.1 and is divided by 10 every 30 epochs with a fixed weight decay 0.0001. We then fine-tune the model for further with the classifier-centric constraint with a small learning rate 0.0001. For the CUB dataset experiment, all the pre-trained models we used are from the Pytorch official zoo. During the training, the initial learning if 0.001, then decreases by 0.1 times at 30 epoch intervals.

4.3 Results

Effectiveness of the classifier-centric constraint. To verify the effectiveness of the classifier-centric constraint, we established the following experiments. First, we train two ConvNet models on the base class data, with and without classifier-centric constraints to learn the two feature spaces. Then we randomly sample some samples from each class of the base class dataset to construct two classifiers to classify the test set. Finally, by evaluating their classification performance, it is indicated in which feature space the sample can construct a better decision boundary. The experimental results are shown in Tab.1. We can observe that the feature space learned with cosine softmax loss achieve poor accuracy, that indicates the sample points in this space might be scattered and not close to the classifier weight. By applying the classifier-centric constraint, the accuracy is significantly improved. This demonstrates that the feature space learned with classifier-centric constraint is more suitable for building classifiers using samples.


Novel / Novel Novel / All All

n=1 2 5 10 20 n=1 2 5 10 20 n=1 2 5 10 20

Baseline (G). [14, 7] 51.56 63.67 74.78 79.68 82.45 45.26 58.53 71.80 77.64 80.79 63.26 70.98 78.35 81.42 83.01
L 51.59 63.80 75.57 80.60 83.21 46.51 60.31 73.88 79.57 82.45 62.57 70.24 77.19 79.80 81.02
H 48.94 58.64 69.23 80.60 83.21 48.08 59.05 68.87 73.03 75.39 58.84 64.92 70.18 72.35 73.59

52.49 64.08 74.21 78.57 80.57 49.83 62.07 73.17 77.85 80.38 63.95 70.79 76.71 79.07 80.34
G+L 55.50 67.51 78.26 82.75 85.00 49.28 63.16 76.17 81.40 84.01 65.70 73.60 80.49 83.00 84.14

55.78 67.43 77.63 81.84 83.97 52.19 65.05 76.52 81.12 83.43 66.25 73.36 79.24 81.43 82.48
L+G+H+CC 57.12 68.28 77.77 81.80 83.72 53.48 65.05 76.59 80.95 83.07 67.49 73.36 79.87 81.98 82.95

Table 4: Oblation study experiments on the ImageNet based few-shot benchmark. G: Global, L: Local, H:Higher-level, CC:Classifier-centric. Best are bolded.


Novel / Novel
Novel classes from ImageNet Novel classes from CUB2011

n=1 2 5 10 20 n=1 2 5 10 20

Global 51.56 63.67 74.78 79.68 82.45 30.55 40.76 53.68 60.79 65.54
Local 51.59 63.80 75.57 80.60 83.21 35.99 48.40 62.51 70.26 74.92
Higher-level 48.94 58.64 69.23 73.32 75.65 24.45 32.19 40.92 46.18 49.18
Multi-level 55.50 67.51 78.26 82.75 85.00 36.15 48.34 62.44 69.94 74.37

Table 5: The performance of using different levels of representation for few-shot learning on the same task (Generic object classification) and another different task (Fine-grained object classification). Top-5 accuracy of the novel categories in the novel label space (Novel/Novel) is reported. Best are bolded.

Few-shot Classification accuracy. We then evaluate the few-shot classification accuracy of the proposed method on both Few-shot-ImageNet and Few-shot-Cub dataset and compare with recent works. For Few-shot-ImageNet dataset, we report the top-5 accuracy on the novel categories in the novel label space (Novel/Novel), the novel categories in the all label space (Novel/All) and on all categories. The result is shown in Tab. 2. It can be seen that our approach outperforms all other methods when there are only 1 or 2 training examples of novel class. When the number of training examples is increasing, our methods did not present superior performance on the evaluation settings ”Novel/Novel” and ”All”. This is because we simply use the mean features of novel training examples as the classifier weights without any further training on the novel training examples, while other methods either train or fine-tune a learner from them. Despite this, our approach still has comparable performance in these settings. More interestingly, our methods consistently outperform other methods on ”Novel/All’ setting in all ranges of training number. We also provide some prediction results in Fig. 3, which can be used to intuitively analyze the few-shot learning ability of different representation. For example, the test images in the second column mostly contain some patterns (e.g., objects or parts of objects) which are very similar to those occurs in the training examples, while the similarities between images in the last two columns and the training images tend to be subtle. For Few-shot-Cub dataset, we report the top-1 accuracy on the novel categories in the novel label space (Novel/Novel), the novel categories in the all label space (Novel/All) and All. We provide comparison with existing methods in Tab. 3 and show that our methods significantly outperform methods reported results on this dataset.

Cross-task performance of few-shot learning. We also investigate the few-shot learning ability of different levels of representations on a cross-task dataset. To achieve this, we perform some experiments on a cross-task evaluation setting, where novel classes are selected from another different task. In this experiment, we use all categories in the Caltech-UCSD bird dataset [23], a fine-grained benchmark, as the novel concepts, while base categories are selected from a generic object recognition benchmark - the challenging ImageNet dataset. Here, we use the same base categories used in [8]. We then measure the performance by the top-5 accuracy on the test set of Caltech-UCSD bird dataset and report the results in Tab. 5. Not surprisingly that the representation learned from the predicted class distribution performs the worst in the cross-task evaluation setting, because it is more specific to source task and fails to characterize the class neighbourhood structure of an novel example from a new task. On the other hand, the proposed multi-level representation achieves the best over-all performance. This indicates that our proposed method is less sensitive to domain shift than existing methods.

Ablation study. In Tab. 4, we provide an ablation study of our proposed method on the Few-shot-imagenet benchmarks. We evaluate the few-shot learning ability of features from local-level(L), global-level (G), higher-level (H), and some of their combinations. We use cosine similarity classifier with only the latter layer as baseline which are introduced by [14, 7]. We observe that integrating all three level features achieves the best performance when only provided with 1 or 2 training examples of novel class.

5 Conclusion

In this work, We aim to learn a representation which have better generalization ability and can be used to construct discriminative classifer using few examples. To achieve this goal, we first design three feature extractors based on ConvNet model, which captures local, global and higher-level information. We then introduce a classifier-centric constraint for learning each feature extractor. Such a constraint enforces the samples close to their classifier weights in the feature space. The resulting representation not only has a stronger representation ability for unseen concepts but also can be used to construct a discrinmative classifier using few samples. Our experimental results demonstrate its effectiveness, and also suggest that learning a multi-level representation is essential for few-shot learning task, especially when there is large domain difference between the base data and the novel data. Also, our proposed feature can be used as baseline to advance the study of meta-learning, due to its simplicity and effectiveness.


  • [1] P. Agrawal, R. Girshick, and J. Malik (2014)

    Analyzing the performance of multilayer neural networks for object recognition

    In ECCV, Cited by: §2.2, Table 2.
  • [2] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson (2015) From generic to specific deep representations for visual recognition. In CVPR, Cited by: §1, §2.2.
  • [3] W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019) A closer look at few-shot classification. In ICLR, Cited by: §2.1.
  • [4] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In ICML, Cited by: §2.2.
  • [5] L. Fei-Fei, R. Fergus, and P. Perona (2006) One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 (4), pp. 594–611. Cited by: §2.1.
  • [6] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §1, §1, §2.1, §2.1.
  • [7] S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In CVPR, Cited by: §1, §1, §2.1, §2.1, §3.1.1, §3.1.1, §4.3, Table 1, Table 2, Table 4.
  • [8] B. Hariharan and R. B. Girshick (2017) Low-shot visual recognition by shrinking and hallucinating features.. In ICCV, Cited by: §1, §2.1, §2.1, §4.1, §4.1, §4.3, Table 2, Table 3.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1, §4.2.
  • [10] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §1, §3.1.3.
  • [11] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, Cited by: §2.1, §2.1, §3.1.1.
  • [12] Y. Liu, J. Lee, M. Park, S. Kim, and Y. Yang (2019) Transductive propagation network for few-shot learning. ICLR. Cited by: §2.1.
  • [13] B. Oreshkin, P. R. López, and A. Lacoste (2018) TADAM: task dependent adaptive metric for improved few-shot learning. In NIPS, Cited by: §1.
  • [14] H. Qi, M. Brown, and D. G. Lowe (2018) Low-shot learning with imprinted weights. In CVPR, Cited by: §1, §1, §1, §2.1, §2.1, §3.1.1, §3.1.1, §3.2, §4.1, §4.1, §4.2, §4.3, Table 1, Table 2, Table 3, Table 4.
  • [15] S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018) Few-shot image recognition by predicting parameters from activations. In CVPR, Cited by: §2.1.
  • [16] S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In ICLR, Cited by: §1, §2.1, §2.1.
  • [17] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2019) Meta-learning with latent embedding optimization. ICLR. Cited by: §2.1.
  • [18] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun (2014) Overfeat: integrated recognition, localization and detection using convolutional networks. Cited by: §2.2.
  • [19] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In NIPS, Cited by: §1, §2.1, §3.1.1, Table 2.
  • [20] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §4.2.
  • [21] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2015) Web-scale training for face identification. In CVPR, Cited by: §1.
  • [22] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In NIPS, Cited by: §1, §1, §2.1, §2.1, Table 2, Table 3.
  • [23] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.1, §4.3.
  • [24] Y. Wang, R. Girshick, M. Hebert, and B. Hariharan (2018) Low-shot learning from imaginary data. CVPR. Cited by: §1, §1, §4.1, Table 2.
  • [25] F. S. Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In CVPR, Cited by: §1, §2.1.
  • [26] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In NIPS, Cited by: §1, §2.2.
  • [27] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In ECCV, Cited by: §2.2.
  • [28] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva (2014)

    Learning deep features for scene recognition using places database

    In NIPS, Cited by: §1.