Learning from Adversarial Features for Few-Shot Classification

03/25/2019 ∙ by Wei Shen, et al. ∙ FUJITSU 0

Many recent few-shot learning methods concentrate on designing novel model architectures. In this paper, we instead show that with a simple backbone convolutional network we can even surpass state-of-the-art classification accuracy. The essential part that contributes to this superior performance is an adversarial feature learning strategy that improves the generalization capability of our model. In this work, adversarial features are those features that can cause the classifier uncertain about its prediction. In order to generate adversarial features, we firstly locate adversarial regions based on the derivative of the entropy with respect to an averaging mask. Then we use the adversarial region attention to aggregate the feature maps to obtain the adversarial features. In this way, we can explore and exploit the entire spatial area of the feature maps to mine more diverse discriminative knowledge. We perform extensive model evaluations and analyses on miniImageNet and tieredImageNet datasets demonstrating the effectiveness of the proposed method.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Few-shot classification aims at classifying query images into classes each of which has a few labelled support images. The number of classes is called way and the number of support images is called shot. For example, if a task contains 5 classes and each class has one labelled image, it is a 5-way 1-shot task. If each class has five labelled images, it is a 5-way 5-shot task. The main challenge of few-shot classification is that the classes in the test set have no overlap with those in the training set. Thus, it requires models to be able to generalize to novel classes rather than simply over-fit the training data.

Recent years have witnessed rapid advances in few-shot learning. Meta-learning based methods address the challenge by learning meta knowledge from a number of different tasks and adapting the knowledge to novel tasks [30, 4, 27]. The tasks in the training phase usually mimic the settings that will be used in the test phase to minimize the gap between training settings and test settings. The assumption is that the knowledge shared among those training tasks can be transfered to other novel tasks. Those shared knowledge can be a metric space [27], a good initialization [4] or task representations [16] . Although those methods have achieved improved results, there are two counter intuitive points which need to be noticed. (a) As pointed out in  [27], if the test tasks are 5-shot tasks but the model is trained on 1-shot tasks the performance will be much lower than the model trained on 5-shot tasks and vice versa. It means that we have to train one model for 1-shot and another for 5-shot tasks in order to achieve good performance on both tasks. (b) If the way of the tasks we use to train the model is the same as those of the tasks in the test phase, we will not achieve the highest classification accuracy either. Instead we have to increase the number of ways (to 20-way) of training tasks to get improved classification accuracy on test tasks (5-way).

Another line of current work is to dynamically generate classifier’s weights for novel classes. Given a pretrained model, one can compare the feature of a novel class to those of the known classes and inference the weight for the novel class using attention mechanism [5]

. However, the training phase has to be usually split into two stages, one for feature extraction and the other for weight generation. One can also train a model to learn to generate different features from the distribution of a given class on the training data and perform such augmentation for novel classes with few samples 

[25]. Thus, in the test phase the classifier can be trained with more data. Nevertheless, generating those samples and training a new classifier is time consuming.

Although the above mentioned methods have achieve impressive results, few of them pay much attention to the quality of the basic feature extractor. Training neural networks in an end-to-end way sometimes results in models over-fitting the training data. For example, if a classifier is trained on a dog dataset, the classifier may pay much attention to the head of the dog and less to the body even if the body may also contain discriminative information 

[36]. Another extreme example is described as blind spots in [11] where a predictive model trained on images of black dogs and white cats will incorrectly label a white dog (test image) as a cat with high confidence. The blind spot example indicates that the model may merely learn some conspicuous discriminative features and does not dive deep into the intrinsic characteristic of the target object. This problem is more obvious in the scenario of few-shot learning. Since test classes are different from training classes, if the model over-fits the training data, its performance on novel classes will be poor.

In this work, we show that models trained on adversarially perturbed features can generalize better compared to those trained on clean features. Even without further adaptation on novel tasks, we can still surpass current sate-of-the-art classification accuracy on few-shot classification tasks. The pipeline is shown in Figure 1. In the training phase, we feed images to our model and obtain the convolutional feature maps. Based on the feature maps, we extract three feature representations, the global pooling feature, the adversarial feature and the high level feature. The last two features are used to construct multi-scale classification loss that is used to update the parameters of our model while the first one is used to generate adversarial attention mask. By incorporating the adversarial mask, we force the classifier to leave its comfortable zone and focus on other regions. In this way, our model can explore and exploit the overall feature maps in-depth to learn more diverse discriminative information compared to traditional training scheme. In the test phase, only the global average pooling features are used for object representation.

The contributions of our work are as follows.

  • We propose a feature learning scheme based on adversarial features. Our model spatially explores and exploits the feature maps to mine discriminative information. The learned feature extractor can be directly used for few-shot classification without any adaptation on novel tasks.

  • We propose to use a multi-scale classifier to generate the adversarial attention. We show that compared to the classifier trained on single scale features, the attention region of multi-scale classifier is more accurately focused on the object.

  • We show that with the proposed method we can use the same trained model for both 5-way 1-shot tasks and 5-way 5-shot tasks and achieve new state-of-the art results on both tasks. It demonstrates that our model indeed learns an efficient metric space that generalize well on novel tasks.

2 Related work

2.1 Few-shot learning

In this section, we roughly categorize recent few-shot learning methods into two categories, meta-learning based approaches and weight generation based approaches.

Meta-learning based approaches. Models in this group are typically trained on episodes of tasks [21, 24, 28, 30]. Snell introduced prototypical networks that learned a metric space in which classification could be performed by computing distances to prototype representations of each class [27]

. Sung proposed relation networks that learned a deep distance metric to compare images within episodes 

[28]. Boris proposed to learn a task-dependent metric space where the task representation was the mean of the class prototypes [16].

Weight generation based approaches. Models in this group learn to generate classifier’s weights for novel classes [5, 19, 20, 25, 32, 35]. Spyros devised a model that was able to efficiently learn novel categories from only a few training data while not forgetting the initial classes on which it was trained [5]. Huang directly set weights for new classes based on an appropriately scaled copy of the embedding layer activations [19]. Siyuan also analyzed the relationship between the parameters and the activations and proposed to adapt a pre-trained network to novel categories by predicting the parameters from the activations [20]. Instead of directly predicting classifier’s weight, Schwartz proposed to learn to extract transferable intra-class deformations between same-class pairs and to apply those deformations to examples from a novel class [25]. Ruixiang augmented vanilla few-shot classification models with the ability to discriminate between real and fake data so that the decision boundary became much sharper leading to better generalization [35].

Figure 1:

The pipeline of the proposed method in the training phase (left) and the test phase (right). In the training phase, we extract three feature vectors from low level feature maps

, the global average pooling feature (1st row), the adversarial feature (2nd row) and the high level feature (3rd row). is the weight matrix of the classifier. The gradient of the entropy with respect to is back propagate to

and then is used to obtain the adversarial region attention (orange line). The two cross entropy losses are used to train the model and update the parameters of the entire network. In the test phase, we only extract the global average pooling features from the low level feature maps as representations of images. Cosine similarity is used as the metric to measure the similarity between two images. (Best viewed in color.)

2.2 Adversarial learning

Adversarial learning is becoming more and more popular in recent years. In this section, we briefly discuss the most relevant work adversarial attack and adversarial complementary learning.

Adversarial attack. Adversarial examples are referred to inputs formed by applying small but intentionally worst-case perturbations resulting in the model outputting an incorrect answer [6]

. Adversarial example generation approaches can find the flaw of a machine learning model and then attack the model. There are many ways to generate adversarial perturbations 

[17, 14, 18, 6, 29, 8]. Ian designed a gradient sign method (FGSM) to generate the perturbation according to the sign of the gradient of the cost function with respect to the input [6]. Nicolas proposed Jacobian-based Saliency Map Attack (JSMA) to construct adversarial saliency maps enabling an efficient exploration of the adversarial-samples search space [18].

Adversarial complementary learning is introduced in  [33, 36] for object localization. In adversarial complementary learning, two or more classifiers are trained to progressively mine discriminative object regions. Yunchao proposed to use three classification networks to sequentially discover complement object regions by erasing the current mined regions [33]. Xiaolin integrated three networks in [33] into a single network and generated the localization map by forwarding the network only once.

3 Method

In this section, we will give a detailed description of our method. Firstly, we will show how to obtain adversarial features from low level feature maps. Then we will introduce multi-scale feature learning. Finally, we will provide some implementation details.

3.1 Adversarial features generation

3.1.1 Adversarial goal

Given feature maps with size and a classifier , we are trying to find an attention mask with which the linear combination of feature vectors in can cause the classifier uncertain about its predictions. We refer to as the adversarial region attention. Formally,




depending on whether the ground-truth label of the input image is used. is the metric that measures the uncertainty of the classifier’s prediction. In this work, we will use entropy as default to measure the uncertainty of the prediction. The entropy can be written as



is the prediction probability of the

-th class and is the number of classes.

3.1.2 Adversarial features

One can strictly adhere to Equation (1) or  (2) to calculate the analytic solution of . However, we empirically observe that one single gradient step applied on an averaging mask works surprisingly well in all our scenarios. In order to obtain using back propagation, we firstly describe the forward pass. To reduce the dimension of convolution feature maps, we apply global average pooling on feature maps as in [7]. Global average pooling spatially averages layer activations and outputs a feature vector. It can also be explicitly implemented by applying an averaging mask with each element be on the feature maps and sum up the activations as


where indicates the spatial location on the feature maps. Once we have the global average pooling feature vector , we feed it to the classifier

which will output a probability distribution over the training classes. If the classifier is very confident about its prediction, the distribution will have a single peak and otherwise the distribution will be flat. We try to find feature vectors from the feature maps that can flatten the probability distribution. Since we use entropy

to indicate uncertainty, we back propagate the gradient of with respect to the mask . Then we can obtain the adversarial region attention mask by updating as


where is the step size. Given , we can directly calculate the adversarial feature as


From Equation (7), we can find that adversarial feature is a combination of an averaged feature representation and a small perturbation of adversarial noise. Therefore, models trained on those perturbed features will have lower risk of over-fitting the training data and the learned metric space will also be smooth which is helpful for knowledge transfer.

3.1.3 An intuitive understanding

Suppose we have a classifier trained on the average pooling feature vectors . The classifier can be sensitive to features in some conspicuous discriminative regions. If the feature activations in those regions dominate the average representation, the classifier will be very confident about its prediction and it will enter its comfort zone and stop learning other discriminative information. Mathematically speaking, in this case we are in the saturation area of the log softmax where the gradient magnitude is very small. Fortunately, we are able to find adversarial regions on the feature maps that can reduce our model’s confidence. Then we can force our model to learn from those regions and thus the discriminative information learned by our model is more diverse than that learned by conventional models. With diverse discriminative knowledge, our model can generalize well on novel tasks.

3.1.4 Design choice

Adversarial features from feature maps. In adversarial attack, researchers usually try to find subtle perturbations in image space that cause the classifier to make incorrect predictions while our adversarial features are calculated based on convolutional feature maps. The reason is that we are focusing on forcing our model to learn diverse semantic discriminative information which is helpful for generalization. However, subtle changes in image space does not provide any semantically different information. On the contrary, our adversarial region attention can be applied on convolutional feature maps to dynamically generate semantically different adversarial features according to Equation 7.

Entropy loss The reason to choose entropy loss instead of cross entropy loss in adversarial attention generation is that it offers much more abundant information in back propagation. For cross entropy loss, when the classifier is well trained, the gradient magnitude with respect to the mask will be too small for efficient learning.

3.2 Multi-scale feature learning

3.2.1 High level feature learning

Convolutional networks extract high level structured information at high layers [34]. We train a fully convoutional neural network whose output feature maps size is where is the number of channels. We denote this feature vector as . It contains the overall image information since it gradually extracts more and more hierarchical information from the entire image. The classification loss for this level is the cross entropy loss


where is the label of the input sample and is the softmax probability.

3.2.2 Low level feature learning

In few shot learning, the classes in the test phase and those in the training phase are disjoint. Since high level features more concentrate on class-specific knowledge, the activations from the last layer cannot be directly used as the representations of novel categories. A common practice is to use the activations from intermediate layers to avoid the dataset bias. The assumption is that even though the categories are disjoint between the training data and the test data, they may still share some common local patterns. The activations from the intermediate layers are sensitive to those local patterns and thus are potentially transferable for novel classes.

Given the feature maps from the intermediate layer. One can train a classifier based on the aggregated feature from . In this work, we use the adversarial attention mask for aggregation. The training loss is also the cross entropy loss


Note that the adversarial mask is only used in the training phase for mining adversarial features. In the test phase, we apply global average pooling to obtain the feature vector.

3.2.3 Multi-scale learning

During training, the loss function is the combination of high level classification loss and low level adversarial classification loss


The entropy loss is not used for parameter update. It is only used to calculate the adversarial mask .

Despite the generalization capability of low level features, the relatively small receptive field of low level activations may cause the model incidentally capture non-object regions as the representative features of the class. This happens especially when global pooling is applied to the last convolutional feature maps. The reason is that global pooling removes spatial information from the feature maps and only local activations are aggregated for classification. In contrast, the receptive field of high level activations covers the entire image. Therefore, it is reasonable to assume that the features learned at high level to be more accurate. To improve discriminative feature learning in low level layers, we share the classifier between high level feature learning and low level feature learning. This simple multi-scale learning strategy aligns features from different scales by enabling the knowledge flow between high level representations and the low level representations. The benefit of the multi-scale learning strategy is shown in Figure 2.

3.2.4 Design choice

Cosine similarity. For both the entropy loss and the cross entropy loss, we use the cosine similarity between the feature representations and the classifier’s weight vectors instead of the dot product as the distance metric before softmax activations. Cosine similarity based models are demonstrated to generalize significantly better on novel categories [5]. However, the range of cosine similarity is fixed to [-1, 1] which is difficult for efficient learning. A common practice it to multiply the similarity value with a scaling factor which can be a fix value [3] or a learnable parameter [31] to control the peakiness of the probability distribution. In this work we fix the scaling factors due to its simplicity in implementation.

Step size in Equation 5 and 7 is just like the learning rate when training deep models. It cannot be too large or too small. Since it is multiplied to the gradient from the entropy loss, we assume there exists a reciprocal relationship between and the scaling factor for stable training. In other words, if the scaling factor is large, we should choose a small and otherwise we choose a large value. Therefore, we have


Note that we may not strictly follow this reciprocal relationship. A slightly different value does not effect the performance of our model much (see Section 4.3.3 for details).

The pseudo code of the proposed method is provided in Algorithm 1.

0:  Low level feature maps , a classifier and an averaging mask .
   GlobalAveragePool(, )
   Back propagate the partial derivative
   Forward to another conv-pool block
  Update network parameters to minimize .
Algorithm 1 Learning from adversarial features

3.3 Implementation details

For both miniImageNet dataset and tieredImageNet dataset we use the same network and the same hyper-parameter settings. The learning rate is set to 0.001 as [27]

and is halved every 10 epochs. We train our model for 50 epochs and choose the model that achieves the best 5-way 1-shot classification accuracy on validation set for testing. Note that our model is trained just like common classification models whose input is a batch of images rather than a batch of tasks used in meta-learning. We train our model once and test it on both 5-way 1-shot and 5-way 5-shot tasks. For 1-shot tasks, we compare the cosine similarity of the query image features to each support image feature and assign the label of the nearest neighbor to the query image. For 5-shot tasks, we average the feature vectors of the five images that belong to the same class as the prototype representation of that class. Then we assign the label of the nearest prototype to the query image. The CNN backbone we use is a VGG-like 

[26] network. Details are shown in Table 1. We used activations from conv5 layer as the low level feature maps . The input image size is so that the spatial size of feature maps from the conv5 layer is which is a sufficient size for adversarial feature mining. The scaling factor for training cross entropy loss is fixed to 20 and that for adversarial region attention is fixed to 5 and thus . In the training phase, we only employ random flip as data augmentation. In the test phase, we randomly sample 1000 tasks and report the average accuracy.

4 Experiments

In this section, we evaluate our method on both miniImageNet dataset and tieredImageNet dataset. We will compare our results with current state-of-the-art results on both 5-way 1-shot tasks and 5-way 5-shot tasks. We also perform ablation studies to show how much improvement is brought by adversarial feature learning and how much is brought by multi-scale feature learning. We also evaluate the vulnerability of different baseline models.

4.1 Datasets

miniImageNet [30] is a subset of ILSVRC-12 [23]. It contains 100 classes with 600 images per class. We follow the split in  [21]. There are 64, 16, 20 classes for training, validation and test.

tieredImageNet [22] is a much larger subset of ILSVRC-12 [23]

. It contains 608 classes belonging to 34 categories grouped according to ImageNet 

[2]. These categories are split into 20 training (351 classes), 6 validation (97 classes) and 8 test (160 classes) categories. Unlike miniImageNet dataset, all the training classes in tieredImageNet dataset are sufficiently distinct from the test classes.

Layer Network details
conv1 2 Conv(128,3,3)-BN-leakyReLU
- Maxpool(2,2)
conv2 2 Conv(128,3,3)-BN-leakyReLU
- Maxpool(2,2)
conv3 2 Conv(256,3,3)-BN-leakyReLU
- Maxpool(2,2)
conv4 2 Conv(512,3,3)-BN-leakyReLU
- Maxpool(2,2)
conv5 2 Conv(512,3,3)-BN-leakyReLU
- Maxpool(2,2)
conv6 2 Conv(512,3,3)-BN-leakyReLU
- Maxpool(2,2)
conv7 Conv(512,2,2)-BN-leakyReLU
Table 1: Network details. “Conv(,3,3)” indicates there are convolution kernels with size . 2

means there are two consecutive blocks. “Maxpool(2,2) ” means max-pooling with stride 2 and pooling window size is

. “BN ” indicates batch normalization 

[9]. All leakyReLUs share the same ratio of 0.2 in the negative region. “FC ” is the fully connected layer.

4.2 Comparison with state-of-the-art

In this section, we compare our results with the state-of-the-art results. The results are shown in Table 2 and 3. For 1-shot tasks, the classification accuracy of our model achieves 1.52% and 3.21% improvements on miniImageNet and tieredImageNet respectively. For 5-shot tasks, the classification accuracy of our model achieves 1.11% and 2.91% improvements on miniImageNet and tieredImageNet respectively. Note that our model is quite simple and we perform no further adaptation to novel tasks. We use the same model for both 1-shot tasks and 5-shot tasks while most state-of-the-art models are trained separately for 1-shot tasks and 5-shot tasks. The results show that the generalization capability of our model is superior to previous top models.

Model 1-shot 5-shot
SNAIL [13] 55.71 0.99% 68.88 0.92%
Spyros  [5] 56.20 0.86% 73.00 0.64%
TADAM [16] 58.5 0.3% 76.7 0.3%
TPN [12] 55.51 0.86% 69.86 0.65%
adaResNet [15] 56.88 0.62% 71.94 0.57%
Self-Jig [1] 58.80 1.36% 76.71 0.72%
CAML [10] 59.23 0.99% 72.35 0.71%
Qiao  [20] 59.60 0.41% 73.74 0.19%
LEO [24] 61.76 0.08% 77.59 0.12%
ours 63.28 0.46 % 78.70 0.42%
Table 2:

Comparison with the state-of-the-art results with 95% confidence intervals on miniImageNet 5-way classification.

Model 1-shot 5-shot
MAML [12] 51.67 1.81% 70.30 1.75%
Prototypical Nets [22] 53.31 0.89% 72.69 0.74%
Relation Net [12] 54.48 0.93% 71.31 0.78%
TPN [12] 59.91 0.94% 73.30 0.75%
LEO [24] 66.33 0.05% 81.44 0.09 %
ours 69.54 0.52% 84.35 0.32%
Table 3: Comparison with the state-of-the-art results with 95% confidence intervals on tieredImageNet 5-way classification.
Figure 2: Adversarial region attention visualization. Red regions are positive regions that contribute to the classifier’s correct prediction while the blue regions are adversarial regions that reduce classifier’s confidence. The 1st row presents the input images. The 2nd row shows the adversarial region attention from C5-cls (see Section 4.3 for details) trained with low level features and the 3rd row shows the adversarial region attention from C5-C7-cls trained with multi-scale features. It can be seen from the 2nd row that model C5-cls incorrectly treats some background regions as the representative object. In the 3rd row, the object are more accurately attended to. (Best viewed in color.)

4.3 Ablation study

In order to show how much each component of our model contributes to the final performance. We conduct ablation studies in this section. Here we firstly introduce the baseline models that will be used for comparison. Although the training loss or the network architecture of baseline models are slightly different, layers from conv1 to conv5 which will be used in the test phase are the same as listed in Table 1 for all models.

  • C5-cls. The backbone CNN for this baseline model contains conv1 to conv5 layers. Feature maps from conv5 layer are global average pooled as feature vectors fed to the final linear classifier. The training loss is the cross entropy loss and the model is trained without adversarial features.

  • C5-adv. This model has the same backbone network as C5-cls. The difference is that this model is trained with the adversarial features.

  • C5-C7-cls. This model has the same architecture with our model containing conv1 to conv7 layers. However, the model is trained using only multi-scale global average pooling features for classification.

miniImageNet tieredImageNet
Model 1-shot 5-shot 1-shot 5-shot
C5-cls 58.24% 74.05% 66.21% 81.33%
C5-adv 61.56% 76.97% 68.90% 82.24%
C5-C7-cls 58.02% 74.32% 67.75% 82.57%
ours 63.28% 78.70% 69.54% 84.35%
Table 4: Comparison with different baseline models on miniImageNet and tieredImageNet 5-way classification.

The comparison results are shown in Table 4. By comparing the classification accuracy of our model to that of C5-cls, we can find large improvements on both 1-shot and 5-shot tasks demonstrating the effectiveness of adversarial feature learning and multi-scale feature learning. In the following subsections, we will evaluate the effectiveness of each component.

4.3.1 Evaluation of adversarial attention

By comparing C5-adv to C5-cls and our model to C5-C7-cls, we observe consistent improvements on both 1-shot and 5-shot tasks. The reason is that when trained with adversarial features, our model is able to explore the entire spatial area of the feature maps. Compared to focusing on conspicuous discriminative regions, spatial exploration provides chances to find more discriminative regions. The performance gap also provides evidence that (a) conventionally trained models tend to over-fit the training data and thus the generalization capability on novel classes is limited and (b) adversarial region attention can help reduce such over-fitting and increase the generalization capability without any change of the feature extraction part.

4.3.2 Evaluation of multi-scale feature learning

For C5-cls and C5-C7-cls, there is a noticeable difference in their adversarial region attention. We illustrate the results in Figure 2. From the figure, we can find that the classifier of C5-cls sometimes cannot capture the correct objects. It may mis-classify the background as the representative object. This is caused by the fact that the receptive field of the activations in conv5 is not large enough to capture the overall semantic discriminative features. When we use multi-scale feature learning, we observe that the attended regions are obviously more accurate. More accurate localization also helps adversarial feature learning which is justified by the fact that the classification accuracy of our model is higher than C5-adv on both datasets.

Figure 3: 5-way 1/5-shot classification accuracy on miniImageNet and tieredImageNet with respect to different .

4.3.3 Evaluation of varying step size

In Equation (7), if

is too large, the averaged feature representation will be dominated and the large variance of feature vectors will cause the network too difficult to converge. However, a too small

will result in our model degrading to global average pooling. Therefore, we perform our evaluation of in a reasonable range from 0.1 to 0.8 in Figure 3. It can be found that within such a range the performance of our model is quite stable. The reason could be that the classifier is trained on multi-scale features in which the high level features are not perturbed. Thus they can serve as a reference for the alignment of adversarial features and thus contribute to the stable performance.

Figure 4: Classification accuracy on the training data of miniImageNet with respect to different levels of adversarial perturbations. The accuracies of models trained without adversarial features drop drastically even only a small amount of adversarial perturbations are added while models trained with adversarial features are much more robust to perturbations. (Best viewed in color.)

4.3.4 Evaluation of model vulnerability

In this section, we evaluate the vulnerability of baseline models when fed with adversarially perturbed features. Once we have trained all models, we gradually increase adversarial perturbations to construct adversarial features according to Equation (7) with increasing . The results are shown in Figure 4. The performance of C5-adv and our model are quite stable while the performance drops sharply for C5-cls and C5-C7-cls. It indicates that the conventionally trained model over-fits the training data such that a slight adversarial perturbation can cause incorrect prediction.

4.3.5 Effect of dataset size

In Table 4, the improvements on miniImageNet are 5.04% and 4.65% on 1-shot and 5-shot tasks while the improvements on tieredImageNet are 3.33% and 3.02% respectively. The difference in the improvements could be explained by the different size of the two datasets. tieredImageNet contains 351 classes for training while miniImageNet contains only 64 classes. With less diverse data, over-fitting problem can be more serious and thus impacts the model’s ability to generalization. Therefore, the improvements on miniImageNet is larger than those on tieredImageNet.

5 Conclusion

In this paper, we propose to learn generalizable features by learning from adversarial features. This approach is typically useful for tasks like few-shot classification where the test classes are different from training classes. Our model is quite simple and the trained feature extractor part can be easily combined with other methods. In our future work, we will try further adaptations like [5] to see how much the performance can be improved. We will also try to automatically learn the scaling factor in cross entropy loss which is a fixed value in current work.


  • [1] Z. Chen, Y. Fu23, K. Chen, and Y.-G. Jiang123. Image block augmentation for one-shot learning. In

    Thirty-Third AAAI Conference on Artificial Intelligence

    , 2019.
  • [2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009.
  • [3] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou.

    Arcface: Additive angular margin loss for deep face recognition.

    In CVPR, 2019.
  • [4] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1126–1135. JMLR. org, 2017.
  • [5] S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4367–4375, 2018.
  • [6] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [8] W. He, B. Li, and D. Song. Decision boundary analysis of adversarial examples. In International Conference on Learning Representations, 2018.
  • [9] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 448–456, 2015.
  • [10] X. Jiang, M. Havaei, F. Varno, G. Chartrand, N. Chapados, and S. Matwin. Learning to learn with conditional class dependencies. In International Conference on Learning Representations, 2019.
  • [11] H. Lakkaraju, E. Kamar, R. Caruana, and E. Horvitz. Identifying unknown unknowns in the open world: Representations and policies for guided exploration. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [12] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. Hwang, and Y. Yang. Learning to propagate labels: Transductive propagation network for few-shot learning. In International Conference on Learning Representations, 2019.
  • [13] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141, 2017.
  • [14] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.
  • [15] T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler.

    Rapid adaptation with conditionally shifted neurons.

    In International Conference on Machine Learning, pages 3661–3670, 2018.
  • [16] B. Oreshkin, P. R. López, and A. Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pages 719–729, 2018.
  • [17] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519. ACM, 2017.
  • [18] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami.

    The limitations of deep learning in adversarial settings.

    In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pages 372–387. IEEE, 2016.
  • [19] H. Qi, M. Brown, and D. G. Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5822–5830, 2018.
  • [20] S. Qiao, C. Liu, W. Shen, and A. L. Yuille. Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7229–7238, 2018.
  • [21] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In International Conference on Learning Representations, 2017.
  • [22] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • [24] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.
  • [25] E. Schwartz, L. Karlinsky, J. Shtok, S. Harary, M. Marder, A. Kumar, R. Feris, R. Giryes, and A. Bronstein. Delta-encoder: an effective sample synthesis method for few-shot object recognition. In Advances in Neural Information Processing Systems, pages 2850–2860, 2018.
  • [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  • [27] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • [28] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
  • [29] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • [30] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
  • [31] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: l 2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia, pages 1041–1049. ACM, 2017.
  • [32] Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low-shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7278–7286, 2018.
  • [33] Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1568–1576, 2017.
  • [34] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
  • [35] R. Zhang, T. Che, Z. Ghahramani, Y. Bengio, and Y. Song. Metagan: An adversarial approach to few-shot learning. In Advances in Neural Information Processing Systems, pages 2371–2380, 2018.
  • [36] X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. S. Huang. Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1325–1334, 2018.