Free Lunch for Few-shot Learning: Distribution Calibration

01/16/2021 ∙ by Shuo Yang, et al. ∙ University of Technology Sydney 0

Learning from a limited number of samples is challenging since the learned model can easily become overfitted based on the biased distribution formed by only a few training examples. In this paper, we calibrate the distribution of these few-sample classes by transferring statistics from the classes with sufficient examples, then an adequate number of examples can be sampled from the calibrated distribution to expand the inputs to the classifier. We assume every dimension in the feature representation follows a Gaussian distribution so that the mean and the variance of the distribution can borrow from that of similar classes whose statistics are better estimated with an adequate number of samples. Our method can be built on top of off-the-shelf pretrained feature extractors and classification models without extra parameters. We show that a simple logistic regression classifier trained using the features sampled from our calibrated distribution can outperform the state-of-the-art accuracy on two datasets ( 5 visualization of these generated features demonstrates that our calibrated distribution is an accurate estimation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Arctic fox
mean sim var sim
white wolf 97% 97%
malamute 85% 78%
lion 81% 70%
meerkat 78% 70%
jellyfish 46% 26%
orange 40% 19%
beer bottle 34% 11%
Table 1: The class mean similarity (“mean sim”) and class variance similarity (“var sim”) between Arctic fox and different classes.

Learning from a limited number of training samples has drawn increasing attention due to the high cost of collecting and annotating a large amount of data. Researchers have developed algorithms to improve the performance of models that have been trained with very few data. Finn et al. (2017); Snell et al. (2017) train models in a meta-learning fashion so that the model can adapt quickly on tasks with only a few training samples available. Hariharan and Girshick (2017); Wang et al. (2018) try to synthesize data or features by learning a generative model to alleviate the data insufficiency problem. Ren et al. (2018) propose to leverage unlabeled data and predict pseudo labels to improve the performance of few-shot learning.

While most previous works focus on developing stronger models, scant attention has been paid to the property of the data itself. It is natural that when the number of data grows, the ground truth distribution can be more accurately uncovered. Models trained with a wide coverage of data can generalize well during evaluation. On the other hand, when training a model with only a few training data, the model tends to overfit on these few samples by minimizing the training loss over these samples. These phenomena are illustrated in Figure 1. This biased distribution based on a few examples can damage the generalization ability of the model since it is far from mirroring the ground truth distribution from which test cases are sampled during evaluation.

Here, we consider calibrating this biased distribution into a more accurate approximation of the ground truth distribution. In this way, a model trained with inputs sampled from the calibrated distribution can generalize over a broader range of data from a more accurate distribution rather than only fitting itself to those few samples. Instead of calibrating the distribution of the original data space, we try to calibrate the distribution in the feature space, which has much lower dimensions and is easier to calibrate (Xian et al. (2018)

). We assume every dimension in the feature vectors follows a Gaussian distribution and observe that similar classes usually have similar mean and variance of the feature representations, as shown in Table 

1. Thus, the mean and variance of the Gaussian distribution can be transferred across similar classes  (Salakhutdinov et al. (2012)). Meanwhile, the statistics can be estimated more accurately when there are adequate samples for this class. Based on these observations, we reuse the statistics from many-shot classes and transfer them to better estimate the distribution of the few-shot classes according to their class similarity. More samples can be generated according to the estimated distribution which provides sufficient supervision for training the classification model.

In the experiments, we show that a simple logistic regression classifier trained with our strategy can achieve state-of-the-art accuracy on two datasets. Our distribution calibration strategy can be paired with any classifier and feature extractor with no extra learnable parameters. Training with samples selected from the calibrated distribution can achieve 12% accuracy gain compared to the baseline which is only trained with the few samples given in a 5way1shot task. We also visualize the calibrated distribution and show that it is an accurate approximation of the ground truth that can better cover the test cases.

Figure 1: Training a classifier from few-shot features makes the classifier overfit to the few examples (Left). Classifier trained with features sampled from calibrated distribution has better generalization ability (Right).

2 Related Works

Few-shot classification is a challenging machine learning problem and researchers have explored the idea of learning to learn or meta-learning to improve the quick adaptation ability to alleviate the few-shot challenge. One of the most general algorithms for meta-learning is the optimization-based algorithm.

Finn et al. (2017) and Li et al. (2017) proposed to learn how to optimize the gradient descent procedure so that the learner can have a good initialization, update direction, and learning rate. For the classification problem, researchers proposed simple but effective algorithms based on metric learning. MatchingNet (Vinyals et al., 2016) and ProtoNet (Snell et al., 2017)

learned to classify samples by comparing the distance to the representatives of each class. Our distribution calibration and feature sampling procedure does not include any learnable parameters and the classifier is trained in a traditional supervised learning way.

Another line of algorithms is to compensate for the insufficient number of available samples by generation. Most methods use the idea of Generative Adversarial Networks (GANs) 

(Goodfellow et al., 2014)

or autoencoder  

(Rumelhart et al., 1986) to generate samples (Zhang et al. (2018); Chen et al. (2019b); Schwartz et al. (2018); Gao et al. (2018)) or features (Xian et al. (2018); Zhang et al. (2019)) to augment the training set. Specifically, Zhang et al. (2018) and Xian et al. (2018) proposed to synthesize data by introducing an adversarial generator conditioned on tasks. Zhang et al. (2019) tried to learn a variational autoencoder to approximate the distribution and predict labels based on the estimated statistics. The autoencoder can also augment samples by projecting between the visual space and the semantic space (Chen et al., 2019b) or encoding the intra-class deformations (Schwartz et al., 2018). Liu et al. (2019b) and Liu et al. (2019a)

propose to generate features through the class hierarchy. While these methods can generate extra samples or features for training, they require the design of a complex model and loss function to learn how to generate. However, our distribution calibration strategy is simple and does not need extra learnable parameters.

Data augmentation is a traditional and effective way of increasing the number of training samples. Qin et al. (2020) and Antoniou and Storkey (2019) proposed the used of the traditional data augmentation technique to construct pretext tasks for unsupervised few-shot learning. Wang et al. (2018) and Hariharan and Girshick (2017) leveraged the general idea of data augmentation, they designed a hallucination model to generate the augmented version of the image with different choices for the model’s input, i.e., an image and a noise  (Wang et al., 2018) or the concatenation of multiple features (Hariharan and Girshick, 2017).  Park et al. (2020); Wang et al. (2019); Liu et al. (2020) tried to augment feature representations by leveraging intra-class variance. These methods learn to augment from the original samples or their feature representation while we try to estimate the class-level distribution and thus can eliminate the inductive bias from a single sample and provide more diverse generations from the calibrated distribution.

3 Main Approach

In this section, we introduce the few-shot classification problem definition in Section 3.1 and details of our proposed approach in Section 3.2.

3.1 Problem Definition

We follow a typical few-shot classification setting. Given a dataset with data-label pairs where is the feature vector of a sample and , where denotes the set of classes. This set of classes is divided into base classes and novel classes , where and . The goal is to train a model on the data from the base classes so that the model can generalize well on tasks sampled from the novel classes. In order to evaluate the fast adaptation ability or the generalization ability of the model, there are only a few available labeled samples for each task . The most common way to build a task is called an N-way-K-shot task (Vinyals et al. (2016)), where N classes are sampled from the novel set and only K (e.g., 1 or 5) labeled samples are provided for each class. The few available labeled data are called support set and the model is evaluated on another query set , where every class in the task has test cases. Thus, the performance of a model is evaluated as the averaged accuracy on (the query set of) multiple tasks sampled from the novel classes.

3.2 Distribution Calibration

As introduced in Section 3.1, the base classes have a sufficient amount of data while the evaluation tasks sampled from the novel classes only have a limited number of labeled samples. The statistics of the distribution for the base class can be estimated more accurately compared to the estimation based on few-shot samples, which is an ill-posed problem. As shown in Table 1, we observe that if we assume the feature distribution is Gaussian, the mean and variance with respect to each class are correlated to the semantic similarity of each class. With this in mind, the statistics can be transferred from the base classes to the novel classes if we learn how similar the two classes are. In the following sections, we discuss how we calibrate the distribution estimation of the classes with only a few samples (Section 3.2.2) with the help of the statistics of the base classes (Section 3.2.1). We will also elaborate on how do we leverage the calibrated distribution to improve the performance of few-shot learning (Section 3.2.3).

Note that our distribution calibration strategy is over the feature-level and is agnostic to any feature extractor. Thus, it can be built on top of any pretrained feature extractors without further costly fine-tuning. In our experiments, we use the pretrained WideResNet following previous work (Mangla et al. (2020)). The WideResNet is trained to classify the base classes, along with a self-supervised pretext task to learn the general-purpose representations suitable for image understanding tasks. Please refer to their paper for more details on training the feature extractor.

0:  Support set features
0:  Base classes’ statistics ,
1:  Transform with Tukey’s Ladder of Powers as Equation 3
2:  for  do
3:      Calibrate the mean and the covariance for class using with Equation 6
4:      Sample features for class from the calibrated distribution as Equation 7
5:  end for
6:  Train a classifier using both support set features and all sampled features as Equation 8
Algorithm 1 Training procedure for an N-way-K-shot task

3.2.1 Statistics of the base classes

We assume the feature distribution of base classes is Gaussian. The mean of the feature vector from a base class is calculated as the mean of every single dimension in the vector:

(1)

where is a feature vector of the -th sample from the base class and is the total number of samples in class . As the feature vector is multi-dimensional, we use covariance for a better representation of the variance between any pair of elements in the feature vector. The covariance matrix for the features from class is calculated as:

(2)

3.2.2 Calibrating statistics of the novel classes

Here, we consider an N-way-K-shot task sampled from the novel classes.

Tukey’s Ladder of Powers Transformation

To make the feature distribution more Gaussian-like, we first transform the features of the support set and query set in the target task using Tukey’s Ladder of Powers transformation (Tukey (1977)

). Tukey’s Ladder of Powers transformation is a family of power transformations which can reduce the skewness of distributions and make distributions more Gaussian-like. Tukey’s Ladder of Powers transformation is formulated as:

(3)

where is a hyper-parameter to adjust how to correct the distribution. The original feature can be recovered by setting as 1. Decreasing makes the distribution less positively skewed and vice versa.

Calibration through statistics transfer

Using the statistics from the base classes introduced in Section 3.2.1, we transfer the statistics from the base classes which are estimated more accurately on sufficient data to the novel classes. The transfer is based on the Euclidean distance between the feature space of the novel classes and the mean of the features from the base classes as computed in Equation 1. Specifically, we select the top base classes with the closest distance to the feature of a sample from the support set:

(4)
(5)

where is an operator to select the top elements from the input distance set . stores the nearest base classes with respect to a feature vector . Then, the mean and covariance of the distribution is calibrated by the statistics from the nearest base classes:

(6)

where is a hyper-parameter that determines the degree of dispersion of features sampled from the calibrated distribution.

For few-shot learning with more than one shot, the aforementioned procedure of the distribution calibration should be undertaken multiple times with each time using one feature vector from the support set. This avoids the bias provided by one specific sample and potentially achieves more diverse and accurate distribution estimation. Thus, for simplicity, we denote the calibrated distribution as a set of statistics. For a class , we denote the set of statistics as , where , are the calibrated mean and covariance, respectively, computed based on the -th feature in the support set of class . Here, the size of the set is the value of for an N-way-K-shot task.

3.2.3 How to leverage the calibrated distribution?

With a set of calibrated statistics for class in a target task, we generate a set of feature vectors with label by sampling from the calibrated Gaussian distributions:

(7)

Here, the total number of generated features per class is set as a hyperparameter and they are equally distributed for every calibrated distribution in

. The generated features along with the original support set features for a few-shot task is then served as the training data for a task-specific classifier. We train the classifier for a task by minimizing the cross-entropy loss over both the features of its support set and the generated features :

(8)

where is the set of classes for the task . denotes the support set with features transformed by Turkey’s Ladder of Powers transformation and the classifier model is parameterized by .

Methods miniImageNet CUB
5way1shot 5way5shot 5way1shot 5way5shot
Optimization-based
MAML (Finn et al. (2017))
Meta-SGD (Li et al. (2017))
LEO (Rusu et al. (2019)) - -
E3BM (Liu et al. (2020b)) - -
Metric-based
Matching Net (Vinyals et al. (2016))
Prototypical Net (Snell et al. (2017))
Baseline++ (Chen et al. (2019a))
Variational Few-shot(Zhang et al. (2019)) - -
Negative-Cosine(Liu et al. (2020a))
Generation-based
MetaGAN (Zhang et al. (2018)) - -
Delta-Encoder (Schwartz et al. (2018))
TriNet (Chen et al. (2019b))
Meta Variance Transfer (Park et al. (2020)) - -
Maximum Likelihood with DC (Ours)
SVM with DC (Ours)
Logistic Regression with DC (Ours)
Table 2: 5way1shot and 5way5shot classification accuracy (%) on miniImageNet and CUB with 95% confidence intervals. The numbers in bold have intersecting confidence intervals with the most accurate method.
Methods tieredImageNet
5way1shot 5way5shot
Matching Net (Vinyals et al. (2016))
Prototypical Net (Snell et al. (2017))
LEO (Rusu et al. (2019))
E3BM (Liu et al. (2020b))
DeepEMD (Zhang et al., 2020)
Maximum Likelihood with DC (Ours)
SVM with DC (Ours)
Logistic Regression with DC (Ours)
Table 3: 5way1shot and 5way5shot classification accuracy (%) on tieredImageNet (Ren et al., 2018). The numbers in bold have intersecting confidence intervals with the most accurate method.

4 Experiments

In this section, we answer the following questions:

  • How does our distribution calibration strategy perform compared to the state-of-the-art methods?

  • What does calibrated distribution look like? Is it an accurate approximation for this class?

  • How does Tukey’s Ladder of Power transformation interact with the feature generations? How important is each in relation to performance?

4.1 Experimental Setup

4.1.1 Datasets

We evaluate our distribution calibration strategy on miniImageNet (Ravi and Larochelle (2017)), tieredImageNet (Ren et al. (2018)) and CUB (Welinder et al. (2010)). miniImageNet and tieredImageNet have a brand range of classes including various animals and objects while CUB is a more fine-grained dataset that includes various species of birds. Datasets with different levels of granularity may have different distributions for their feature space. We want to show the effectiveness and generality of our strategy on all three datasets.

miniImageNet is derived from ILSVRC-12 dataset (Russakovsky et al., 2014). It contains 100 diverse classes with 600 samples per class. The image size is . We follow the splits used in previous works (Ravi and Larochelle, 2017), which split the dataset into 64 base classes, 16 validation classes, and 20 novel classes.

tieredImageNet is a larger subset of ILSVRC-12 dataset (Russakovsky et al., 2014), which contains 608 classes sampled from hierarchical category structure. Each class belongs to one of 34 higher-level categories sampled from the high-level nodes in the ImageNet. The average number of images in each class is 1281. We use 351, 97, and 160 classes for training, validation, and test, respectively.

CUB is a fine-grained few-shot classification benchmark. It contains 200 different classes of birds with a total of 11,788 images of size . Following previous works (Chen et al., 2019a), we split the dataset into 100 base classes, 50 validation classes, and 50 novel classes.

4.1.2 Evaluation Metric

We use the top-1 accuracy as the evaluation metric to measure the performance of our method. We report the accuracy on 5way1shot and 5way5shot settings for

miniImageNet, tieredImageNet and CUB. The reported results are the averaged classification accuracy over 10,000 tasks.

4.1.3 Implementation Details

For feature extractor, we use the WideResNet trained following previous work (Mangla et al. (2020)

). For each dataset, we train the feature extractor with base classes and test the performance using novel classes. Note that the feature representation is extracted from the penultimate layer (with a ReLU activation function) from the feature extractor, thus the values are all non-negative so that the inputs to Tukey’s Ladder of Powers transformation in Equation 

3 are valid. At the distribution calibration stage, we compute the base class statistics and transfer them to calibrate novel class distribution for each dataset. We use the LR and SVM implementation of scikit-learn (Pedregosa et al. (2011)) with the default settings. We use the same hyperparameter value for all datasets except for . Specifically, the number of generated features is 750; and . is 0.21, 0.21 and 0.3 for miniImageNet, tieredImageNet and CUB, respectively. The source code is available at: https://github.com/ShuoYang-1998/ICLR2021-Oral_Distribution_Calibration

4.2 Comparision to State-of-the-art

Table 2 and Table 3 presents the 5way1shot and 5way5shot classification results of our method on miniImageNet, tieredImageNet and CUB. We compare our method with the three groups of the few-shot learning method, optimization-based, metric-based, and generation-based. Our method can be built on top of any classifier, and we use two popular and simple classifiers, namely SVM and LR to prove the effectiveness of our method. Simple linear classifiers equipped with our method perform better than the state-of-the-art few-shot classification method and achieve the best performance on 1-shot and 5-shot settings of miniImageNet, tieredImageNet and CUB. The performance of our distribution calibration surpasses the state-of-the-art generation-based method by 10% for the 5way1shot setting, which proves that our method can handle extremely low-shot classification tasks better. Compared to other generation-based methods, which require the design of a generative model with extra training costs on the learnable parameters, simple machine learning classifier with DC is much more simple, effective and flexible and can be equipped with any feature extractors and classifier model structures. Specifically, we show three variants, i.e, Maximum likelihood with DC, SVM with DC, Logistic Regression with DC in Table 2 and Table 3. A simple maximum likelihood classifier based on the calibrated distribution can outperform previous baselines and training a SVM classifier or Logistic Regression classifier using the samples from the calibrated distribution can further improve the performance.

Figure 2: t-SNE visualization of our distribution estimation. Different colors represent different classes. ‘’ represents support set features, ‘x’ in figure (d) represents query set features, ‘’ in figure (b)(c) represents generated features.
Figure 3: Left: Accuracy when increasing the power in Tukey’s transformation when training with (red) or without (blue) the generated features. Right: Accuracy when increasing the number of generated features with the features are transformed by Tukey’s transformation (red) and without Tukey’s transformation (blue).
Tukey transformation Training with generated features miniImageNet
5way1shot 5way5shot
Table 4: Ablation study on miniImageNet 5way1shot and 5way5shot showing accuracy (%) with 95% confidence intervals.

4.3 Visualization of Generated Samples

We show what the calibrated distribution looks like by visualizing the generated features sampled from the distribution. In Figure 2, we show the t-SNE representation (van der Maaten and Hinton (2008)) of the original support set (a), the generated features (b,c) as well as the query set (d). Based on the calibrated distribution, the sampled features form a Gaussian distribution and more samples (c) can have a more comprehensive representation of the distribution. Due to the limited number of examples in the support set, only 1 in this case, the samples from the query set usually cover a greater area and are a mismatch with the support set. This mismatch can be fixed to some extent by the generated features, i.e., the generated features in (c) can overlap areas of the query set. Thus, training with these generated features can alleviate the mismatch between the distribution estimated only from the few-shot samples and the ground truth distribution.

4.4 Applicability of distribution calibration

Applying distribution calibration on different backbones

Our distribution calibration strategy is agnostic to backbones / feature extractors. Table 5 shows the consistent performance boost when applying distribution calibration on different feature extractors, i.e, four convolutional layers (conv4), six convolutional layers (conv6), resnet18, WRN28 and WRN28 trained with rotation loss. Distribution calibration achieves around 10% accuracy improvement compared to the backbones trained with different baselines.

Backbones without DC with DC
conv4 (Chen et al., 2019a) ()
conv6 (Chen et al., 2019a) ()
resnet18 (Chen et al., 2019a) ()
WRN28 (Mangla et al., 2020) ()
WRN28 + Rotation Loss (Mangla et al., 2020) ()
Table 5: 5way1shot classification accuracy (%) on miniImageNet with different backbones.

Applying distribution calibration on other baselines

A variety of works can benefit from training with the features generated by our distribution calibration strategy. We apply our distribution calibration strategy on two simple few-shot classification algorithms, Baseline (Chen et al., 2019a) and Baseline++ (Chen et al., 2019a). Table 6 shows that our distribution calibration brings over 10% of accuracy improvement on both.

Method without DC with DC
Baseline (Chen et al., 2019a) ()
Baseline++ (Chen et al., 2019a) ()
Table 6: 5way1shot classification accuracy (%) on miniImageNet with different baselines using distribution calibration.

4.5 Effects of feature transformation and training with generated features

Ablation Study

Table 4 shows the performance when our model is trained without Tukey’s Ladder of Powers transformation for the features as in Equation 3 and when it is trained without the generated features as in Equation 7. It is clear that there is a severe decline in performance of over 10% if both are not used in the 5way1shot setting. The ablation of either one results in a performance drop of around 5% in the 5way1shot setting.

Choices of Power for Tukey’s Ladder of Powers Transformation

The left side of Figure 3 shows the 5way1shot accuracy when choosing different powers for the Tukey’s transformation in Equation 3 when training the classifier with the generated features (red) and without (blue). Note that when the power equals 1, the transformation keeps the original feature representations. There is a consistent general tendency for training with and without the generated features and in both cases, we found is the optimum choice. With the Tukey’s transformation, the distribution of query set features in target tasks become more aligned to the calibrated Gaussian distribution, thus benefits the classifier which is trained on features sampled from the calibrated distribution.

Number of generated features

The right side of Figure 3 analyzes whether more generated features results in consistent improvement in both cases, namely when the features of support and query set are transformed by Tukey’s transformation (red) and when they are not (blue). We found that when the number of generated features is below 500, both cases can benefit from more generated features. However, when more features are sampled, the performance of the classifier tested on untransformed features begins to decline. By training with the generated samples, the simple logistic regression classifier has a 12% relative performance improvement in a 1-shot classification setting.

4.6 Other Hyper-parameters

We select the hyperparameters based on the performance of the validation set. The k base class statistics to calibrate the novel class distribution in Equation 5 is set to 2. Figure 5 shows the effect of different values of k. The in Equation 6 is a constant added on each element of the estimated covariance matrix, which can determine the degree of dispersion of features sampled from the calibrated distributions. An appropriate value of can ensure a good decision boundary for the classifier. Different datasets have different statistics and an appropriate value of may vary for different datasets. Figure 5 explores the effect of on all three datasets, i.e. miniImageNet, tieredImageNet and CUB. We observe that in each dataset, the performance of the validation set and the novel (testing) set generally has the same tendency, which indicates that the variance is dataset-dependent and is not overfitting to a specific set.

Figure 4: The effect of different values of k.
Figure 5: The effect of different values of .

5 Conclusion and future works

We propose a simple but effective distribution calibration strategy for few-shot classification. Without complex generative models, training loss and extra parameters to learn, a simple logistic regression trained with features generated by our strategy outperforms the current state-of-the-art methods by on miniImageNet. The calibrated distribution is visualized and demonstrates an accurate estimation of the feature distribution. Future works will explore the applicability of distribution calibration on more problem settings, such as multi-domain few-shot classification, and more methods, such as metric-based meta-learning algorithms.

References

  • A. Antoniou and A. J. Storkey (2019) Assume, augment and learn: unsupervised few-shot meta-learning via random labels and data augmentation. CoRR. Cited by: §2.
  • W. Chen, Y. Liu, Z. Kira, Y. F. Wang, and J. Huang (2019a) A closer look at few-shot classification. In ICLR, Cited by: Table 2, §4.1.1, §4.4, Table 5, Table 6.
  • Z. Chen, Y. Fu, Y. Zhang, Y. Jiang, X. Xue, and L. Sigal (2019b) Multi-level semantic feature augmentation for one-shot learning. TIP 28 (9), pp. 4594–4605. Cited by: §2, Table 2.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §1, §2, Table 2.
  • H. Gao, Z. Shou, A. Zareian, H. Zhang, and S. Chang (2018) Low-shot learning via covariance-preserving adversarial augmentation networks. In NeurIPS, Cited by: §2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §2.
  • B. Hariharan and R. Girshick (2017) Low-shot visual recognition by shrinking and hallucinating features. In ICCV, Cited by: §1, §2.
  • Z. Li, F. Zhou, F. Chen, and H. Li (2017) Meta-sgd: learning to learn quickly for few shot learning. CoRR. External Links: 1707.09835 Cited by: §2, Table 2.
  • B. Liu, Y. Cao, Y. Lin, Q. Li, Z. Zhang, M. Long, and H. Hu (2020a) Negative margin matters: understanding margin in few-shot classification. In ECCV, Cited by: Table 2.
  • J. Liu, Y. Sun, C. Han, Z. Dou, and W. Li (2020) Deep representation learning on long-tailed data: a learnable embedding augmentation perspective. In CVPR, Cited by: §2.
  • L. Liu, T. Zhou, G. Long, J. Jiang, L. Yao, and C. Zhang (2019a) Prototype propagation networks (PPN) for weakly-supervised few-shot learning on category graph. In IJCAI, Cited by: §2.
  • L. Liu, T. Zhou, G. Long, J. Jiang, and C. Zhang (2019b) Learning to propagate for graph meta-learning. In NeurIPS, Cited by: §2.
  • Y. Liu, B. Schiele, and Q. Sun (2020b)

    An ensemble of epoch-wise empirical bayes for few-shot learning

    .
    In ECCV, Cited by: Table 2, Table 3.
  • P. Mangla, N. Kumari, A. Sinha, M. Singh, B. Krishnamurthy, and V. N. Balasubramanian (2020) Charting the right manifold: manifold mixup for few-shot learning. In WACV, Cited by: §3.2, §4.1.3, Table 5.
  • S. Park, S. Han, J. Baek, I. Kim, J. Song, H. B. Lee, J. Han, and S. J. Hwang (2020)

    Meta variance transfer: learning to augment from the others

    .
    In ICML, Cited by: §2, Table 2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.1.3.
  • T. Qin, W. Li, Y. Shi, and Y. Gao (2020) Diversity helps: unsupervised few-shot learning via distribution shift-based data augmentation. External Links: 2004.05805 Cited by: §2.
  • S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In ICLR, Cited by: §4.1.1, §4.1.1.
  • M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel (2018) Meta-learning for semi-supervised few-shot classification. In ICLR, Cited by: §1, Table 3, §4.1.1.
  • D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning Representations by Back-propagating Errors. Nature 323, pp. 533–536. Cited by: §2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2014) ImageNet large scale visual recognition challenge. CoRR abs/1409.0575. Cited by: §4.1.1, §4.1.1.
  • A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu, S. Osindero, and R. Hadsell (2019) Meta-learning with latent embedding optimization. In ICLR, Cited by: Table 2, Table 3.
  • R. Salakhutdinov, J. Tenenbaum, and A. Torralba (2012) One-shot learning with a hierarchical nonparametric bayesian model. In ICML workshop, Cited by: §1.
  • E. Schwartz, L. Karlinsky, J. Shtok, S. Harary, M. Marder, A. Kumar, R. Feris, R. Giryes, and A. Bronstein (2018) Delta-encoder: an effective sample synthesis method for few-shot object recognition. In NeurIPS, Cited by: §2, Table 2.
  • J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. In NeurIPS, Cited by: §1, §2, Table 2, Table 3.
  • J. W. Tukey (1977) Exploratory data analysis. Addison-Wesley Series in Behavioral Science, Addison-Wesley, Reading, MA. External Links: Link Cited by: §3.2.2.
  • L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research. Cited by: §4.3.
  • O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In NeurIPS, Cited by: §2, §3.1, Table 2, Table 3.
  • Y. Wang, R. Girshick, M. Hebert, and B. Hariharan (2018) Low-shot learning from imaginary data. In CVPR, Cited by: §1, §2.
  • Y. Wang, X. Pan, S. Song, H. Zhang, G. Huang, and C. Wu (2019) Implicit semantic data augmentation for deep networks. In NeurIPS, Cited by: §2.
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §4.1.1.
  • Y. Xian, T. Lorenz, B. Schiele, and Z. Akata (2018) Feature generating networks for zero-shot learning. In CVPR, Cited by: §1, §2.
  • C. Zhang, Y. Cai, G. Lin, and C. Shen (2020) DeepEMD: few-shot image classification with differentiable earth mover’s distance and structured classifiers. In CVPR, Cited by: Table 3.
  • J. Zhang, C. Zhao, B. Ni, M. Xu, and X. Yang (2019) Variational few-shot learning. In ICCV, Cited by: §2, Table 2.
  • R. Zhang, T. Che, Z. Ghahramani, Y. Bengio, and Y. Song (2018) MetaGAN: an adversarial approach to few-shot learning. In NeurIPS, Cited by: §2, Table 2.

Appendix A augmentation with nearest class features

Instead of sampling from the calibrated distribution, we can simply retrieve examples from the nearest class to augment the support set. Table 7

shows the comparison of training using samples from the calibrated distribution, the different number of retrieved features from the nearest class, and only using the support set. We found the retrieved features can improve the performance compared to only using the support set but can damage the performance when increasing the number of retrieved features, where the retrieved samples probably serve as noisy data for tasks targeting different classes.

Training data miniImageNet 5way1shot
Support set only
Support set + 1 feature from the nearest class
Support set + 5 features from the nearest class
Support set + 10 features from the nearest class
Support set + 100 features from the nearest class
Support set + 100 features sampled from calibrated distribution
Table 7: The comparison with nearest class feature augmentation.

Appendix B Distribution Calibration without novel feature

We calibrate the novel class mean by averaging the novel class mean and the retrieved base class means in Equation 6. Table 8 shows the distribution calibration without averaging novel feature, in which the calibrated mean is calculated as .

miniImageNet 5way1shot
Distribution Calibration w/o novel feature
Distribution Calibration w/ novel feature
Table 8: The comparison between distribution calibration with and without novel feature .

Appendix C The effects of Tukey’s transformation

Figure 6 shows the distribution of 5 base classes and 5 novel classes before/after Tukey’s transformation. It is observed that the base class distribution satisfies Gaussian assumption well (left) while the novel class distribution is more skew (middle). The novel class distribution after Tukey’s transformation (right) is more aligned with the Gaussian-like base class distribution.

Appendix D The similarity level analysis

We found that the higher similarities between the retrieved base class distribution and the novel class ground-truth distribution, the higher the performance improvement our method will bring as shown in Table 9. The results in the table are under 5-way-1-shot setting.

Novel class Top-1 base class similarity Top-2 base class similarity DC improvement
malamute 93% 85% 21.30%
golden retriever 85% 74% 18.37%
ant 71% 67% 9.77%
Table 9: Performance improvement with respect to the similarity level between a query novel class and the most similar base classes.