1 Introduction
Arctic fox  

mean sim  var sim  
white wolf  97%  97% 
malamute  85%  78% 
lion  81%  70% 
meerkat  78%  70% 
jellyfish  46%  26% 
orange  40%  19% 
beer bottle  34%  11% 
Learning from a limited number of training samples has drawn increasing attention due to the high cost of collecting and annotating a large amount of data. Researchers have developed algorithms to improve the performance of models that have been trained with very few data. Finn et al. (2017); Snell et al. (2017) train models in a metalearning fashion so that the model can adapt quickly on tasks with only a few training samples available. Hariharan and Girshick (2017); Wang et al. (2018) try to synthesize data or features by learning a generative model to alleviate the data insufficiency problem. Ren et al. (2018) propose to leverage unlabeled data and predict pseudo labels to improve the performance of fewshot learning.
While most previous works focus on developing stronger models, scant attention has been paid to the property of the data itself. It is natural that when the number of data grows, the ground truth distribution can be more accurately uncovered. Models trained with a wide coverage of data can generalize well during evaluation. On the other hand, when training a model with only a few training data, the model tends to overfit on these few samples by minimizing the training loss over these samples. These phenomena are illustrated in Figure 1. This biased distribution based on a few examples can damage the generalization ability of the model since it is far from mirroring the ground truth distribution from which test cases are sampled during evaluation.
Here, we consider calibrating this biased distribution into a more accurate approximation of the ground truth distribution. In this way, a model trained with inputs sampled from the calibrated distribution can generalize over a broader range of data from a more accurate distribution rather than only fitting itself to those few samples. Instead of calibrating the distribution of the original data space, we try to calibrate the distribution in the feature space, which has much lower dimensions and is easier to calibrate (Xian et al. (2018)
). We assume every dimension in the feature vectors follows a Gaussian distribution and observe that similar classes usually have similar mean and variance of the feature representations, as shown in Table
1. Thus, the mean and variance of the Gaussian distribution can be transferred across similar classes (Salakhutdinov et al. (2012)). Meanwhile, the statistics can be estimated more accurately when there are adequate samples for this class. Based on these observations, we reuse the statistics from manyshot classes and transfer them to better estimate the distribution of the fewshot classes according to their class similarity. More samples can be generated according to the estimated distribution which provides sufficient supervision for training the classification model.In the experiments, we show that a simple logistic regression classifier trained with our strategy can achieve stateoftheart accuracy on two datasets. Our distribution calibration strategy can be paired with any classifier and feature extractor with no extra learnable parameters. Training with samples selected from the calibrated distribution can achieve 12% accuracy gain compared to the baseline which is only trained with the few samples given in a 5way1shot task. We also visualize the calibrated distribution and show that it is an accurate approximation of the ground truth that can better cover the test cases.
2 Related Works
Fewshot classification is a challenging machine learning problem and researchers have explored the idea of learning to learn or metalearning to improve the quick adaptation ability to alleviate the fewshot challenge. One of the most general algorithms for metalearning is the optimizationbased algorithm.
Finn et al. (2017) and Li et al. (2017) proposed to learn how to optimize the gradient descent procedure so that the learner can have a good initialization, update direction, and learning rate. For the classification problem, researchers proposed simple but effective algorithms based on metric learning. MatchingNet (Vinyals et al., 2016) and ProtoNet (Snell et al., 2017)learned to classify samples by comparing the distance to the representatives of each class. Our distribution calibration and feature sampling procedure does not include any learnable parameters and the classifier is trained in a traditional supervised learning way.
Another line of algorithms is to compensate for the insufficient number of available samples by generation. Most methods use the idea of Generative Adversarial Networks (GANs)
(Goodfellow et al., 2014)or autoencoder
(Rumelhart et al., 1986) to generate samples (Zhang et al. (2018); Chen et al. (2019b); Schwartz et al. (2018); Gao et al. (2018)) or features (Xian et al. (2018); Zhang et al. (2019)) to augment the training set. Specifically, Zhang et al. (2018) and Xian et al. (2018) proposed to synthesize data by introducing an adversarial generator conditioned on tasks. Zhang et al. (2019) tried to learn a variational autoencoder to approximate the distribution and predict labels based on the estimated statistics. The autoencoder can also augment samples by projecting between the visual space and the semantic space (Chen et al., 2019b) or encoding the intraclass deformations (Schwartz et al., 2018). Liu et al. (2019b) and Liu et al. (2019a)propose to generate features through the class hierarchy. While these methods can generate extra samples or features for training, they require the design of a complex model and loss function to learn how to generate. However, our distribution calibration strategy is simple and does not need extra learnable parameters.
Data augmentation is a traditional and effective way of increasing the number of training samples. Qin et al. (2020) and Antoniou and Storkey (2019) proposed the used of the traditional data augmentation technique to construct pretext tasks for unsupervised fewshot learning. Wang et al. (2018) and Hariharan and Girshick (2017) leveraged the general idea of data augmentation, they designed a hallucination model to generate the augmented version of the image with different choices for the model’s input, i.e., an image and a noise (Wang et al., 2018) or the concatenation of multiple features (Hariharan and Girshick, 2017). Park et al. (2020); Wang et al. (2019); Liu et al. (2020) tried to augment feature representations by leveraging intraclass variance. These methods learn to augment from the original samples or their feature representation while we try to estimate the classlevel distribution and thus can eliminate the inductive bias from a single sample and provide more diverse generations from the calibrated distribution.
3 Main Approach
In this section, we introduce the fewshot classification problem definition in Section 3.1 and details of our proposed approach in Section 3.2.
3.1 Problem Definition
We follow a typical fewshot classification setting. Given a dataset with datalabel pairs where is the feature vector of a sample and , where denotes the set of classes. This set of classes is divided into base classes and novel classes , where and . The goal is to train a model on the data from the base classes so that the model can generalize well on tasks sampled from the novel classes. In order to evaluate the fast adaptation ability or the generalization ability of the model, there are only a few available labeled samples for each task . The most common way to build a task is called an NwayKshot task (Vinyals et al. (2016)), where N classes are sampled from the novel set and only K (e.g., 1 or 5) labeled samples are provided for each class. The few available labeled data are called support set and the model is evaluated on another query set , where every class in the task has test cases. Thus, the performance of a model is evaluated as the averaged accuracy on (the query set of) multiple tasks sampled from the novel classes.
3.2 Distribution Calibration
As introduced in Section 3.1, the base classes have a sufficient amount of data while the evaluation tasks sampled from the novel classes only have a limited number of labeled samples. The statistics of the distribution for the base class can be estimated more accurately compared to the estimation based on fewshot samples, which is an illposed problem. As shown in Table 1, we observe that if we assume the feature distribution is Gaussian, the mean and variance with respect to each class are correlated to the semantic similarity of each class. With this in mind, the statistics can be transferred from the base classes to the novel classes if we learn how similar the two classes are. In the following sections, we discuss how we calibrate the distribution estimation of the classes with only a few samples (Section 3.2.2) with the help of the statistics of the base classes (Section 3.2.1). We will also elaborate on how do we leverage the calibrated distribution to improve the performance of fewshot learning (Section 3.2.3).
Note that our distribution calibration strategy is over the featurelevel and is agnostic to any feature extractor. Thus, it can be built on top of any pretrained feature extractors without further costly finetuning. In our experiments, we use the pretrained WideResNet following previous work (Mangla et al. (2020)). The WideResNet is trained to classify the base classes, along with a selfsupervised pretext task to learn the generalpurpose representations suitable for image understanding tasks. Please refer to their paper for more details on training the feature extractor.
3.2.1 Statistics of the base classes
We assume the feature distribution of base classes is Gaussian. The mean of the feature vector from a base class is calculated as the mean of every single dimension in the vector:
(1) 
where is a feature vector of the th sample from the base class and is the total number of samples in class . As the feature vector is multidimensional, we use covariance for a better representation of the variance between any pair of elements in the feature vector. The covariance matrix for the features from class is calculated as:
(2) 
3.2.2 Calibrating statistics of the novel classes
Here, we consider an NwayKshot task sampled from the novel classes.
Tukey’s Ladder of Powers Transformation
To make the feature distribution more Gaussianlike, we first transform the features of the support set and query set in the target task using Tukey’s Ladder of Powers transformation (Tukey (1977)
). Tukey’s Ladder of Powers transformation is a family of power transformations which can reduce the skewness of distributions and make distributions more Gaussianlike. Tukey’s Ladder of Powers transformation is formulated as:
(3) 
where is a hyperparameter to adjust how to correct the distribution. The original feature can be recovered by setting as 1. Decreasing makes the distribution less positively skewed and vice versa.
Calibration through statistics transfer
Using the statistics from the base classes introduced in Section 3.2.1, we transfer the statistics from the base classes which are estimated more accurately on sufficient data to the novel classes. The transfer is based on the Euclidean distance between the feature space of the novel classes and the mean of the features from the base classes as computed in Equation 1. Specifically, we select the top base classes with the closest distance to the feature of a sample from the support set:
(4)  
(5) 
where is an operator to select the top elements from the input distance set . stores the nearest base classes with respect to a feature vector . Then, the mean and covariance of the distribution is calibrated by the statistics from the nearest base classes:
(6) 
where is a hyperparameter that determines the degree of dispersion of features sampled from the calibrated distribution.
For fewshot learning with more than one shot, the aforementioned procedure of the distribution calibration should be undertaken multiple times with each time using one feature vector from the support set. This avoids the bias provided by one specific sample and potentially achieves more diverse and accurate distribution estimation. Thus, for simplicity, we denote the calibrated distribution as a set of statistics. For a class , we denote the set of statistics as , where , are the calibrated mean and covariance, respectively, computed based on the th feature in the support set of class . Here, the size of the set is the value of for an NwayKshot task.
3.2.3 How to leverage the calibrated distribution?
With a set of calibrated statistics for class in a target task, we generate a set of feature vectors with label by sampling from the calibrated Gaussian distributions:
(7) 
Here, the total number of generated features per class is set as a hyperparameter and they are equally distributed for every calibrated distribution in
. The generated features along with the original support set features for a fewshot task is then served as the training data for a taskspecific classifier. We train the classifier for a task by minimizing the crossentropy loss over both the features of its support set and the generated features :(8) 
where is the set of classes for the task . denotes the support set with features transformed by Turkey’s Ladder of Powers transformation and the classifier model is parameterized by .
Methods  miniImageNet  CUB  

5way1shot  5way5shot  5way1shot  5way5shot  
Optimizationbased  
MAML (Finn et al. (2017))  
MetaSGD (Li et al. (2017))  
LEO (Rusu et al. (2019))      
E3BM (Liu et al. (2020b))      
Metricbased  
Matching Net (Vinyals et al. (2016))  
Prototypical Net (Snell et al. (2017))  
Baseline++ (Chen et al. (2019a))  
Variational Fewshot(Zhang et al. (2019))      
NegativeCosine(Liu et al. (2020a))  
Generationbased  
MetaGAN (Zhang et al. (2018))      
DeltaEncoder (Schwartz et al. (2018))  
TriNet (Chen et al. (2019b))  
Meta Variance Transfer (Park et al. (2020))      
Maximum Likelihood with DC (Ours)  
SVM with DC (Ours)  
Logistic Regression with DC (Ours) 
Methods  tieredImageNet  

5way1shot  5way5shot  
Matching Net (Vinyals et al. (2016))  
Prototypical Net (Snell et al. (2017))  
LEO (Rusu et al. (2019))  
E3BM (Liu et al. (2020b))  
DeepEMD (Zhang et al., 2020)  
Maximum Likelihood with DC (Ours)  
SVM with DC (Ours)  
Logistic Regression with DC (Ours) 
4 Experiments
In this section, we answer the following questions:

How does our distribution calibration strategy perform compared to the stateoftheart methods?

What does calibrated distribution look like? Is it an accurate approximation for this class?

How does Tukey’s Ladder of Power transformation interact with the feature generations? How important is each in relation to performance?
4.1 Experimental Setup
4.1.1 Datasets
We evaluate our distribution calibration strategy on miniImageNet (Ravi and Larochelle (2017)), tieredImageNet (Ren et al. (2018)) and CUB (Welinder et al. (2010)). miniImageNet and tieredImageNet have a brand range of classes including various animals and objects while CUB is a more finegrained dataset that includes various species of birds. Datasets with different levels of granularity may have different distributions for their feature space. We want to show the effectiveness and generality of our strategy on all three datasets.
miniImageNet is derived from ILSVRC12 dataset (Russakovsky et al., 2014). It contains 100 diverse classes with 600 samples per class. The image size is . We follow the splits used in previous works (Ravi and Larochelle, 2017), which split the dataset into 64 base classes, 16 validation classes, and 20 novel classes.
tieredImageNet is a larger subset of ILSVRC12 dataset (Russakovsky et al., 2014), which contains 608 classes sampled from hierarchical category structure. Each class belongs to one of 34 higherlevel categories sampled from the highlevel nodes in the ImageNet. The average number of images in each class is 1281. We use 351, 97, and 160 classes for training, validation, and test, respectively.
CUB is a finegrained fewshot classification benchmark. It contains 200 different classes of birds with a total of 11,788 images of size . Following previous works (Chen et al., 2019a), we split the dataset into 100 base classes, 50 validation classes, and 50 novel classes.
4.1.2 Evaluation Metric
We use the top1 accuracy as the evaluation metric to measure the performance of our method. We report the accuracy on 5way1shot and 5way5shot settings for
miniImageNet, tieredImageNet and CUB. The reported results are the averaged classification accuracy over 10,000 tasks.4.1.3 Implementation Details
For feature extractor, we use the WideResNet trained following previous work (Mangla et al. (2020)
). For each dataset, we train the feature extractor with base classes and test the performance using novel classes. Note that the feature representation is extracted from the penultimate layer (with a ReLU activation function) from the feature extractor, thus the values are all nonnegative so that the inputs to Tukey’s Ladder of Powers transformation in Equation
3 are valid. At the distribution calibration stage, we compute the base class statistics and transfer them to calibrate novel class distribution for each dataset. We use the LR and SVM implementation of scikitlearn (Pedregosa et al. (2011)) with the default settings. We use the same hyperparameter value for all datasets except for . Specifically, the number of generated features is 750; and . is 0.21, 0.21 and 0.3 for miniImageNet, tieredImageNet and CUB, respectively. The source code is available at: https://github.com/ShuoYang1998/ICLR2021Oral_Distribution_Calibration4.2 Comparision to Stateoftheart
Table 2 and Table 3 presents the 5way1shot and 5way5shot classification results of our method on miniImageNet, tieredImageNet and CUB. We compare our method with the three groups of the fewshot learning method, optimizationbased, metricbased, and generationbased. Our method can be built on top of any classifier, and we use two popular and simple classifiers, namely SVM and LR to prove the effectiveness of our method. Simple linear classifiers equipped with our method perform better than the stateoftheart fewshot classification method and achieve the best performance on 1shot and 5shot settings of miniImageNet, tieredImageNet and CUB. The performance of our distribution calibration surpasses the stateoftheart generationbased method by 10% for the 5way1shot setting, which proves that our method can handle extremely lowshot classification tasks better. Compared to other generationbased methods, which require the design of a generative model with extra training costs on the learnable parameters, simple machine learning classifier with DC is much more simple, effective and flexible and can be equipped with any feature extractors and classifier model structures. Specifically, we show three variants, i.e, Maximum likelihood with DC, SVM with DC, Logistic Regression with DC in Table 2 and Table 3. A simple maximum likelihood classifier based on the calibrated distribution can outperform previous baselines and training a SVM classifier or Logistic Regression classifier using the samples from the calibrated distribution can further improve the performance.
Tukey transformation  Training with generated features  miniImageNet  

5way1shot  5way5shot  
✗  ✗  
✓  ✗  
✗  ✓  
✓  ✓  
4.3 Visualization of Generated Samples
We show what the calibrated distribution looks like by visualizing the generated features sampled from the distribution. In Figure 2, we show the tSNE representation (van der Maaten and Hinton (2008)) of the original support set (a), the generated features (b,c) as well as the query set (d). Based on the calibrated distribution, the sampled features form a Gaussian distribution and more samples (c) can have a more comprehensive representation of the distribution. Due to the limited number of examples in the support set, only 1 in this case, the samples from the query set usually cover a greater area and are a mismatch with the support set. This mismatch can be fixed to some extent by the generated features, i.e., the generated features in (c) can overlap areas of the query set. Thus, training with these generated features can alleviate the mismatch between the distribution estimated only from the fewshot samples and the ground truth distribution.
4.4 Applicability of distribution calibration
Applying distribution calibration on different backbones
Our distribution calibration strategy is agnostic to backbones / feature extractors. Table 5 shows the consistent performance boost when applying distribution calibration on different feature extractors, i.e, four convolutional layers (conv4), six convolutional layers (conv6), resnet18, WRN28 and WRN28 trained with rotation loss. Distribution calibration achieves around 10% accuracy improvement compared to the backbones trained with different baselines.
Backbones  without DC  with DC 

conv4 (Chen et al., 2019a)  ()  
conv6 (Chen et al., 2019a)  ()  
resnet18 (Chen et al., 2019a)  ()  
WRN28 (Mangla et al., 2020)  ()  
WRN28 + Rotation Loss (Mangla et al., 2020)  () 
Applying distribution calibration on other baselines
A variety of works can benefit from training with the features generated by our distribution calibration strategy. We apply our distribution calibration strategy on two simple fewshot classification algorithms, Baseline (Chen et al., 2019a) and Baseline++ (Chen et al., 2019a). Table 6 shows that our distribution calibration brings over 10% of accuracy improvement on both.
4.5 Effects of feature transformation and training with generated features
Ablation Study
Table 4 shows the performance when our model is trained without Tukey’s Ladder of Powers transformation for the features as in Equation 3 and when it is trained without the generated features as in Equation 7. It is clear that there is a severe decline in performance of over 10% if both are not used in the 5way1shot setting. The ablation of either one results in a performance drop of around 5% in the 5way1shot setting.
Choices of Power for Tukey’s Ladder of Powers Transformation
The left side of Figure 3 shows the 5way1shot accuracy when choosing different powers for the Tukey’s transformation in Equation 3 when training the classifier with the generated features (red) and without (blue). Note that when the power equals 1, the transformation keeps the original feature representations. There is a consistent general tendency for training with and without the generated features and in both cases, we found is the optimum choice. With the Tukey’s transformation, the distribution of query set features in target tasks become more aligned to the calibrated Gaussian distribution, thus benefits the classifier which is trained on features sampled from the calibrated distribution.
Number of generated features
The right side of Figure 3 analyzes whether more generated features results in consistent improvement in both cases, namely when the features of support and query set are transformed by Tukey’s transformation (red) and when they are not (blue). We found that when the number of generated features is below 500, both cases can benefit from more generated features. However, when more features are sampled, the performance of the classifier tested on untransformed features begins to decline. By training with the generated samples, the simple logistic regression classifier has a 12% relative performance improvement in a 1shot classification setting.
4.6 Other Hyperparameters
We select the hyperparameters based on the performance of the validation set. The k base class statistics to calibrate the novel class distribution in Equation 5 is set to 2. Figure 5 shows the effect of different values of k. The in Equation 6 is a constant added on each element of the estimated covariance matrix, which can determine the degree of dispersion of features sampled from the calibrated distributions. An appropriate value of can ensure a good decision boundary for the classifier. Different datasets have different statistics and an appropriate value of may vary for different datasets. Figure 5 explores the effect of on all three datasets, i.e. miniImageNet, tieredImageNet and CUB. We observe that in each dataset, the performance of the validation set and the novel (testing) set generally has the same tendency, which indicates that the variance is datasetdependent and is not overfitting to a specific set.
5 Conclusion and future works
We propose a simple but effective distribution calibration strategy for fewshot classification. Without complex generative models, training loss and extra parameters to learn, a simple logistic regression trained with features generated by our strategy outperforms the current stateoftheart methods by on miniImageNet. The calibrated distribution is visualized and demonstrates an accurate estimation of the feature distribution. Future works will explore the applicability of distribution calibration on more problem settings, such as multidomain fewshot classification, and more methods, such as metricbased metalearning algorithms.
References
 Assume, augment and learn: unsupervised fewshot metalearning via random labels and data augmentation. CoRR. Cited by: §2.
 A closer look at fewshot classification. In ICLR, Cited by: Table 2, §4.1.1, §4.4, Table 5, Table 6.
 Multilevel semantic feature augmentation for oneshot learning. TIP 28 (9), pp. 4594–4605. Cited by: §2, Table 2.
 Modelagnostic metalearning for fast adaptation of deep networks. In ICML, Cited by: §1, §2, Table 2.
 Lowshot learning via covariancepreserving adversarial augmentation networks. In NeurIPS, Cited by: §2.
 Generative adversarial nets. In NeurIPS, Cited by: §2.
 Lowshot visual recognition by shrinking and hallucinating features. In ICCV, Cited by: §1, §2.
 Metasgd: learning to learn quickly for few shot learning. CoRR. External Links: 1707.09835 Cited by: §2, Table 2.
 Negative margin matters: understanding margin in fewshot classification. In ECCV, Cited by: Table 2.
 Deep representation learning on longtailed data: a learnable embedding augmentation perspective. In CVPR, Cited by: §2.
 Prototype propagation networks (PPN) for weaklysupervised fewshot learning on category graph. In IJCAI, Cited by: §2.
 Learning to propagate for graph metalearning. In NeurIPS, Cited by: §2.

An ensemble of epochwise empirical bayes for fewshot learning
. In ECCV, Cited by: Table 2, Table 3.  Charting the right manifold: manifold mixup for fewshot learning. In WACV, Cited by: §3.2, §4.1.3, Table 5.

Meta variance transfer: learning to augment from the others
. In ICML, Cited by: §2, Table 2.  Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §4.1.3.
 Diversity helps: unsupervised fewshot learning via distribution shiftbased data augmentation. External Links: 2004.05805 Cited by: §2.
 Optimization as a model for fewshot learning. In ICLR, Cited by: §4.1.1, §4.1.1.
 Metalearning for semisupervised fewshot classification. In ICLR, Cited by: §1, Table 3, §4.1.1.
 Learning Representations by Backpropagating Errors. Nature 323, pp. 533–536. Cited by: §2.
 ImageNet large scale visual recognition challenge. CoRR abs/1409.0575. Cited by: §4.1.1, §4.1.1.
 Metalearning with latent embedding optimization. In ICLR, Cited by: Table 2, Table 3.
 Oneshot learning with a hierarchical nonparametric bayesian model. In ICML workshop, Cited by: §1.
 Deltaencoder: an effective sample synthesis method for fewshot object recognition. In NeurIPS, Cited by: §2, Table 2.
 Prototypical networks for fewshot learning. In NeurIPS, Cited by: §1, §2, Table 2, Table 3.
 Exploratory data analysis. AddisonWesley Series in Behavioral Science, AddisonWesley, Reading, MA. External Links: Link Cited by: §3.2.2.
 Visualizing data using tSNE. Journal of Machine Learning Research. Cited by: §4.3.
 Matching networks for one shot learning. In NeurIPS, Cited by: §2, §3.1, Table 2, Table 3.
 Lowshot learning from imaginary data. In CVPR, Cited by: §1, §2.
 Implicit semantic data augmentation for deep networks. In NeurIPS, Cited by: §2.
 CaltechUCSD Birds 200. Technical report Technical Report CNSTR2010001, California Institute of Technology. Cited by: §4.1.1.
 Feature generating networks for zeroshot learning. In CVPR, Cited by: §1, §2.
 DeepEMD: fewshot image classification with differentiable earth mover’s distance and structured classifiers. In CVPR, Cited by: Table 3.
 Variational fewshot learning. In ICCV, Cited by: §2, Table 2.
 MetaGAN: an adversarial approach to fewshot learning. In NeurIPS, Cited by: §2, Table 2.
Appendix A augmentation with nearest class features
Instead of sampling from the calibrated distribution, we can simply retrieve examples from the nearest class to augment the support set. Table 7
shows the comparison of training using samples from the calibrated distribution, the different number of retrieved features from the nearest class, and only using the support set. We found the retrieved features can improve the performance compared to only using the support set but can damage the performance when increasing the number of retrieved features, where the retrieved samples probably serve as noisy data for tasks targeting different classes.
Training data  miniImageNet 5way1shot 

Support set only  
Support set + 1 feature from the nearest class  
Support set + 5 features from the nearest class  
Support set + 10 features from the nearest class  
Support set + 100 features from the nearest class  
Support set + 100 features sampled from calibrated distribution 
Appendix B Distribution Calibration without novel feature
We calibrate the novel class mean by averaging the novel class mean and the retrieved base class means in Equation 6. Table 8 shows the distribution calibration without averaging novel feature, in which the calibrated mean is calculated as .
miniImageNet 5way1shot  

Distribution Calibration w/o novel feature  
Distribution Calibration w/ novel feature 
Appendix C The effects of Tukey’s transformation
Figure 6 shows the distribution of 5 base classes and 5 novel classes before/after Tukey’s transformation. It is observed that the base class distribution satisfies Gaussian assumption well (left) while the novel class distribution is more skew (middle). The novel class distribution after Tukey’s transformation (right) is more aligned with the Gaussianlike base class distribution.
Appendix D The similarity level analysis
We found that the higher similarities between the retrieved base class distribution and the novel class groundtruth distribution, the higher the performance improvement our method will bring as shown in Table 9. The results in the table are under 5way1shot setting.
Novel class  Top1 base class similarity  Top2 base class similarity  DC improvement 

malamute  93%  85%  21.30% 
golden retriever  85%  74%  18.37% 
ant  71%  67%  9.77% 
Comments
There are no comments yet.