Self-supervised Knowledge Distillation for Few-shot Learning

06/17/2020 ∙ by Jathushan Rajasegaran, et al. ∙ University of Central Florida 77

Real-world contains an overwhelmingly large number of object classes, learning all of which at once is infeasible. Few shot learning is a promising learning paradigm due to its ability to learn out of order distributions quickly with only a few samples. Recent works [7, 41] show that simply learning a good feature embedding can outperform more sophisticated meta-learning and metric learning algorithms for few-shot learning. In this paper, we propose a simple approach to improve the representation capacity of deep neural networks for few-shot learning tasks. We follow a two-stage learning process: First, we train a neural network to maximize the entropy of the feature embedding, thus creating an optimal output manifold using a self-supervised auxiliary loss. In the second stage, we minimize the entropy on feature embedding by bringing self-supervised twins together, while constraining the manifold with student-teacher distillation. Our experiments show that, even in the first stage, self-supervision can outperform current state-of-the-art methods, with further gains achieved by our second stage distillation process. Our codes are available at:



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


SKD : Self-supervised Knowledge Distillation for Few-shot Learning

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern deep learning algorithms generally require a large amount of annotated data which is often laborious and expensive to acquire

bengio2017deep; khan2018guide

. Inspired by the fact that humans can learn from only a few-examples, few-shot learning (FSL) offers a promising machine learning paradigm. FSL aims to develop models that can generalize to new concepts using only a few annotated samples (typically ranging from 1-5). Due to data scarcity and limited supervision, FSL remains a challenging problem.

Existing works mainly approach FSL using meta-learning finn2017model; li2017meta; jamal2019task; rusu2018metalearning; bertinetto2018meta; lee2019meta; ravichandran2019few to adapt the base learner for the new tasks, or by enforcing margin maximizing constraints through metric learning koch2015siamese; sung2018learning; vinyals2016matching; snell2017prototypical. In doing so, these FSL methods ignore the importance of intra-class diversity while seeking to achieve inter-class discriminability. In this work, instead of learning representations which are invariant to within class changes, we argue for an equivariant representation. Our main intuition is that major transformations in the input domain are desired to be reflected in their corresponding outputs to ensure output space diversity. By faithfully reflecting these changes in an equivariant manner, we seek to learn the true natural manifold of an object class.

We propose a two-stage self-supervised knowledge distillation (SKD) approach for FSL. Despite the availability of only few-shot labeled examples, we show that auxiliary self-supervised learning (SSL) signals can be mined from the limited data, and effectively leveraged to learn the true output-space manifold of each class. For this purpose, we take a direction in contrast to previous works which learn an invariant representation that maps augmented inputs to the same prediction. With the goal to enhance generalizability of the model, we first learn a

Generation zero

(Gen-0) model whose output predictions are equivariant to the input transformations, thereby avoiding overfitting and ensuring heterogeneity in the prediction space. For example, when learning to classify objects in the first stage of learning, the self-supervision based learning objective ensures that the output logits are rich enough to encode the amount of rotation applied to the input images.

Once the generation zero network is trained to estimate the optimal output manifold, we perform knowledge distillation by treating the learned model as a teacher network and training a student model with the teacher’s outputs. Different to first stage, we now enforce that the augmented samples and original inputs result in similar predictions to enhance between-class discrimination. The knowledge distillation mechanism therefore guides the

Generation one (Gen-1) model to develop two intuitive properties. First, the output class manifold is diverse enough to preserve major transformations in the input, thereby avoiding overfitting and improving generalization. Second,

the learned relationships in the output space encode natural connections between classes e.g., two similar classes should have correlated predictions as opposed to totally independent categories considered in one-hot encoded ground-truths. Thus, by faithfully representing the output space via encoding inter-class relationships and preserving intra-class diversity, our approach learns improved representations for FSL.

The following are the main contributions of this work (see Fig. 1 for an overview):

  • [leftmargin=*,topsep=0pt,itemsep=0pt]

  • Different to existing works that use SSL as an auxiliary task, we show the benefit of SSL towards enforcing diversity constraints in the prediction space with a simple architectural modification.

  • A dual-stage training regime which first estimates the optimal output manifold, and then minimizes the original-augmented pair distance while anchoring the original samples to the learned manifold using a distillation loss.

  • Extensive evaluations on four popular benchmark datasets with significant improvements on the FSL task.

Figure 1: Self-supervised Knowledge Distillation operates in two phases. In Gen-0, self-supervision is used to estimate the true prediction manifold, equivariant to input transformations. Specifically, we enforce the model to predict the amount of input rotation using only the output logits. In Gen-1, we force the original sample outputs to be the same as in Gen-0 (dotted lines), while reducing its distance with its augmented versions to enhance discriminability.

2 Related work

Self-supervised learning (SSL): This form of learning defines auxiliary learning tasks that can enhance model’s learning capability without requiring any additional annotation effort. Generally, these surrogate tasks require a higher-level understanding, thereby forcing the learning agent to learn useful representations while solving the auxiliary tasks. The main difference in existing SSL techniques is regarding the way supervisory signal is obtained from the data. For example, gidaris2018unsupervised learns useful representations by predicting the amount of rotation applied to an input image. Doersch et al. doersch2015unsupervised train a CNN to predict the relative position of a pair of randomly sampled image patches. This idea is further extended to predict permutations of multiple image patches in noroozi2016unsupervised

. Alternatively, image colorization and object counting were employed as pretext tasks to improve representation learning

zhang2016colorful; noroozi2017representation. Zhai et al. zhai2019s4l propose an SSL approach in a semi-supervised setting where some labelled and many unlabelled examples were available. Different from these works, our approach uses self-supervision to enforce additional constraints in the classification space. Close to our work is a set of approaches that seek to learn representations that are invariant to image transformations and augmentations dosovitskiy2014discriminative; caron2018deep; chen2020simple. In contrast, our approach does the exact opposite: we seek to learn an equivariant representation, so that the true natural manifold of an object class can be learned with only a few-examples.

Few-shot learning (FSL): There has been several efforts on FSL ranging from metric learning to meta learning methods. Metric learning methods commonly learn a metric space, in which the support set can be easily matched with the query set. For example, Koch et al. koch2015siamese use a Siamese network to learn a similarity metric to classify unknown classes, with the aid of a support set. Sung et al. sung2018learning use a relation module to learn the relationships between support set and the query image. Matching networks vinyals2016matching employ attention and memory to learn a network that matches support set to the query image. In addition, snell2017prototypical assigns the mean embedding as a prototype and minimizes the distance from it with rest of the samples in the query set. In contrast, we only use augmented pairs of an image to move their embeddings closer, while preserving their respective distances in the output space.

Another category of methods employ meta-learning to leverage from the knowledge acquired from the past tasks to learn new tasks. Finn et al. finn2017model proposed a popular model-agnostic meta-learning (MAML) framework, which finds better initialization weights that can be quickly adopted to a given support set. Building on finn2017model,  li2017meta; Flennerhag2020 use meta-learned preconditioning to redirect the gradient-flow to achieve better convergence. In addition to these works, LEO (Latent Embedding Optimization) rusu2018metalearningtransforms network weights to a lower dimensional latent embedding space and applies MAML algorithm to scale to larger networks. MetaOptNet lee2019meta employs an SVM to model meta-learning as a convex optimization problem which is solved using quadratic programming.

Some recent works attribute the success of meta-learning to its strong feature representation capability rather than meta-learning itself raghu2019rapid. Others Dhillon2020A; tian2020rethink show the effectiveness of a simple baseline by learning a strong embedding. This work is an effort along the same direction, and proposes a novel self-supervised knowledge distillation approach which can learn effective feature representations for FSL. The closest to our work is Gidaris et al. gidaris2019boosting, who use self-supervision to boost few-shot classification. However, gidaris2019boosting simply uses self-supervision as an auxiliary loss for single training, while we use it to shape and constrain the learning manifold. Architecture wise, we use a sequential self-supervision layer, while gidaris2019boosting has a parallel design. While gidaris2019boosting does not have multiple generations, we further improve the representations in the second generation, by constraining the embedding space using distillation and bringing embeddings of rotated pairs closer to their original embeddings.

3 Our Approach

The proposed SKD uses a two stage training pipeline; Generation zero (Gen-0) and Generation one (Gen-1). Gen-0 utilizes self-supervision to learn a wider classification manifold, in which the learned embeddings are equivariant to rotation (or another data transformation). Later, during Gen-1, we use Gen-0 model as a teacher and use original (non-rotated) images as anchors to preserve the learned manifold, while rotated version of the images are used to reduce intra-class distances in the embedding space to learn robust and discriminative feature representations.

3.1 Setting

Lets assume a neural network  contains feature embedding parameters , and classification weights . Any input image

can be mapped to a feature vector

by a function . Consequently, features are mapped to logits by another function , where denotes the number of output classes. Hence, conventionally  is defined as a composition of these functions, . In this work, we introduce another function , parameterized by , such that, , which maps logits to a secondary set of logits for self-supervised task (e.g., rotation classification). For each input , we automatically obtain labels for the self-supervision task. Therefore, the complete network can be represented as .

We consider a dataset  with image-label pairs where . During evaluation, we sample episodes as in classical few-shot learning literature. An episode  contains,  and . In an -way -shot setting,  has number of samples for each of classes.

3.2 Generation Zero

During the first stage (aka Gen-0), a minibatch ={,} is randomly sampled from the dataset , which has number of image-label pairs such that . We first take the images  and apply a transformation function  to create augmented copies of . For the sake of brevity, here we consider  as a rotation transformation, however any other suitable transformation can also be considered as we show in our experiments (Sec. 4.2). Applying rotations of and

degrees to , we create ,  and , respectively. Then we combine all augmented versions of images into a single tensor

, , , whose corresponding class labels are  . Additionally, one-hot encoded labels   for the rotation direction are also created, where due to the four rotation directions in our self-supervised task.

First, we pass  through , resulting in the features  . Then, the features are passed through  to get the corresponding logits  , and finally, the logits are passed through , to get the rotation logits  ,

We employ, two loss functions to optimize the model in Gen-0: (a) categorical cross entropy loss

between the predicted logits  and the true labels , and (b) a self-supervision loss between the rotation logits  and rotation labels . Note that, self-supervision loss is simply a binary cross entropy loss. These two loss terms are combined with a weighting coefficient to get our final loss,

The whole process of training the Gen-0 model can be stated as following optimization problem,


The above objective makes sure that the output logits are representative enough to encapsulate information about the input transformation, thereby successfully predicting the amount of rotation applied to the input. This behaviour allows us to maintain diversity in the output space and faithfully estimating the natural data manifold of each object category.

Figure 2: Overall training process of SKD: Generation Zero uses multiple rotated versions of the images to train the neural network to predict the class as well as the rotated angle. Then during Generation One, we use original version of the images as anchor points to preserve the manifold while moving the logits for the rotated version closer, to increase the discriminative ability of the network.

3.3 Generation One

Once the Gen-0 model is trained with cross entropy and self-supervision loss functions, we take two clones of the trained model: A teacher model and a student model . Weights of the teacher model are frozen and used only for inference. Again, we sample a minibatch  from  and generate a twin    from . In this case, a twin  is simply a rotated version of  (e.g., ). During Gen-1 training,  is used as an anchor point to constrain any changes to the classification manifold. This is enforced by a knowledge distillation hinton2015distilling loss between teacher and student networks. Concurrently, an auxiliary loss is employed to bring the embeddings of  and  together to enhance feature discriminability while preserving the original output manifold.

Specifically, we first pass  through the teacher network t t and its logits are obtained. Then, ,  are passed through the to get their corresponding logits , and respectively.

We use Kullback–Leibler (KL) divergence measure between and for knowledge distillation, and apply an loss between and to achieve better discriminability,

where, is a softmax function and is a temperature parameter used to soften the output distribution. Finally, we combine these two loss terms by a coefficient as follows,

The overall Gen-1 training process can be stated as the following optimization problem,


Note that, in our setting, it is necessary to have the rotation classification head sequentially added to the classification layer, unlike the previous works gidaris2019boosting; chen2019self; sun2019unsupervised which connect rotation classification head directly after the feature embedding layer. This is because, during the Gen-0, we encourage the penultimate layer to encode information about both the image class and its rotation (thus preserving output space diversity), and later in Gen-1, we bring the logits of the rotated pairs closer (to improve discrimination). These benefits are not possible if the rotation head is connected directly to the feature embedding layer or if distillation is performed on the feature embeddings.

1:  Require:, , ,
2:     for  iterations do Generation Zero training
3:          while    do
4:               , ,      rotate()
5:                  {, , , },   and      {, , , }
6:                  {, , , where is an all zero vector with length
7:                  (),         (),         ()
8:                     (, ) + (, )
9:                , ,       , , -                
11:     for  iterations do Generation One training
12:          while    do
13:                    rotate()
14:                   t(),        s({, })
15:                     (, ) + (, )
16:                ,       ,                return s
Algorithm 1 Training procedure of SKD

3.4 Evaluation

During evaluation, a held out part of the dataset is used to sample tasks. This comprises of a support set and a query set .  has image-label pairs , while  comprise of an image tensor . Both and are fed to the final trained  model to get the feature embeddings and

, respectively. We use a simple logistic regression classifier 

tian2020rethink; bertinetto2018meta to map the labels from support set to query set. The embeddings are normalized onto a unit sphere tian2020rethink

. We randomly sample 600 tasks, and report mean classification accuracy with 95% confidence interval. Unlike popular meta-learning algorithms (e.g.,

finn2017model; li2017meta), we do not need to train multiple models for different values of and in -way, -shot classification. Since, the classification is disentangled from feature learning in our case, the same model can be used to evaluate for any value of and in FSL.

4 Experiments and Results

We comprehensively compare our method on four benchmark few-shot learning datasets i.e., miniImageNet vinyals2016matching, tieredImageNet ren2018metalearning, CIFAR-FS bertinetto2018meta and FC100 oreshkin2018tadam. Additionally, we provide an extensive ablation study to investigate the individual contributions of different components (Sec. 4.2).

Implementation Details: To be consistent with previous methods tian2020rethink; mishra2018a; oreshkin2018tadam; lee2019meta, we use ResNet-12 as the backbone in our experiments. The backbone architecture contains 4 residual blocks of , , , filters as in  tian2020rethink; ravichandran2019few; lee2019meta, each with convolutions. A max pooling operation is applied after the first 3 blocks and a global average pooling is applied after the last block. Additionally, a neuron fully-connected layer is added after the final classification layer.

We use SGD with an initial learning rate of , momentum of , and a weight decay of

. The learning rate is reduced after epoch 60 by a factor of

. Gen-0 and Gen-1 models on CIFAR-FS are trained for 65 epochs, while rest of the models are trained for 8 epochs only. consistent with previous approaches finn2017model; tian2020rethink; rusu2018metalearning, random crop, color jittering and random horizontal flip are applied for data augmentation during training. Further, the hyper-parameters are tuned on a validation set, and we used the same value of as in tian2020rethink for temperature coefficient during distillation.


We evaluate our method on four widely used FSL benchmarks. These include two datasets which are subsets of the ImageNet i.e., miniImageNet 

vinyals2016matching and tieredImageNet ren2018metalearning, and the other two which are splits of CIFAR100 i.e., CIFAR-FS bertinetto2018meta and FC100 oreshkin2018tadam. For miniImageNet vinyals2016matching, we use the split proposed in ravi2016optimization, with 64, 16 and 20 classes for training, validation and testing. The tieredImageNet ren2018metalearning contains 608 classes which are semantically grouped into 34 high-level classes, that are further divided into 20, 6 and 8 for training, validation, and test splits, thus making the splits more diverse. CIFAR-FS bertinetto2018meta contains a random split of 100 classes into 64, 16 and 20 for training, validation, and testing, while FC100 oreshkin2018tadam uses a split similar to tieredImageNet, making the splits more diverse. FC100 has 60, 20, 20 classes for training, validation, and testing respectively.

miniImageNet, 5-way tieredImageNet, 5-way
Method Backbone 1-shot 5-shot 1-shot 5-shot
MAML finn2017model 32-32-32-32 48.70 1.84 63.11 0.92 51.67 1.81 70.30 1.75
Matching Networks vinyals2016matching 64-64-64-64 43.56 0.84 55.31 0.73   -   -
IMP pmlr-v97-allen19b 64-64-64-64 49.2    0.7 64.7    0.7   -   -
Prototypical Networks snell2017prototypical 64-64-64-64 49.42 0.78 68.20 0.66 53.31 0.89 72.69 0.74
TAML jamal2019task 64-64-64-64 51.77 1.86 66.05 0.85   -   -
SAML hao2019collect 64-64-64-64 52.22 n/a 66.49 n/a   -   -
GCR li2019few 64-64-64-64 53.21 0.80 72.34 0.64   -   -
KTN(Visual) peng2019few 64-64-64-64 54.61 0.80 71.21 0.66   -   -
PARN wu2019parn 64-64-64-64 55.22 0.84 71.55 0.66   -   -
Dynamic Few-shot gidaris2018dynamic 64-64-128-128 56.20 0.86 73.00 0.64   -   -
Relation Networks sung2018learning 64-96-128-256 50.44 0.82 65.32 0.70 54.48 0.93 71.32 0.78
R2D2 bertinetto2018meta 96-192-384-512 51.2    0.6 68.8    0.1   -   -
SNAIL mishra2018a ResNet-12 55.71 0.99 68.88 0.92   -   -
AdaResNet pmlr-v80-munkhdalai18a ResNet-12 56.88 0.62 71.94 0.57   -   -
TADAM oreshkin2018tadam ResNet-12 58.50 0.30 76.70 0.30   -   -
Shot-Free ravichandran2019few ResNet-12 59.04 n/a 77.64 n/a 63.52 n/a 82.59 n/a
TEWAM qiao2019transductive ResNet-12 60.07 n/a 75.90 n/a   -   -
MTL sun2019meta ResNet-12 61.20 1.80 75.50 0.80   -   -
Variational FSL schonfeld2019generalized ResNet-12 61.23 0.26 77.69 0.17   -   -
MetaOptNet lee2019meta ResNet-12 62.64 0.61 78.63 0.46 65.99 0.72 81.56 0.53
Diversity w/ Cooperation dvornik2019diversity ResNet-18 59.48 0.65 75.62 0.48   -   -
Boosting gidaris2019boosting WRN-28-10 63.77 0.45 80.70 0.33 70.53 0.51 84.98 0.36
Fine-tuning Dhillon2020A WRN-28-10 57.73 0.62 78.17 0.49 66.58 0.70 85.55 0.48
LEO-trainval rusu2018metalearning WRN-28-10 61.76 0.08 77.59 0.12 66.33 0.05 81.44 0.09
RFS-simple tian2020rethink ResNet-12 62.02 0.63 79.64 0.44 69.74 0.72 84.41 0.55
RFS-distill tian2020rethink ResNet-12 64.82 0.60 82.14 0.43 71.52 0.69 86.03 0.49
SKD-GEN0 ResNet-12 65.93 0.81 83.15 0.54 71.69 0.91 86.66 0.60
SKD-GEN1 ResNet-12 67.04 0.85 83.54 0.54 72.03 0.91 86.50 0.58
Table 1: FSL results on miniImageNet vinyals2016matching and tieredImageNet ren2018metalearning datasets, with mean accurcy and 95% confidence interval. results obtained by training on train+val sets. Table is an extended version from tian2020rethink.

4.1 Few-shot learning results

Our results shown in Table 1 and 2 suggest that the proposed method consistently outperforms the current methods on all four datasets. Even, our Gen-0 alone performs better than the current state-of-the-art (SOTA) methods by a considerable margin. For example, SKD Gen-0 model surpasses SOTA performance on miniImageNet by 1% on both 5-way 1-shot and 5-way 5-shot tasks. The same can be observed on other datasets. Compared to RFS-simple tian2020rethink (similar to our Gen-0), SKD shows an improvement of on 5-way 1-shot and on 5-way 5-shot learning. The same trend can be observed across other evaluated datasets with consistent - gains over RFS-simple. This is due to the novel self-supervision which enables SKD to learn diverse and generalizable embedding space.

Gen-1 incorporates knowledge distillation and proves even more effective compared with Gen-0. On miniImageNet, we achieve and on 5-way 1-shot and 5-way 5-shot learning tasks, respectively. These are gains of and on 5-way 1-shot and 5-way 5-shot tasks. Similar consistent gains of - over SOTA results can be observed across other evaluated datasets. Note that, RFS-distill tian2020rethink uses multiple iterations (up to 3-4 generations) for model distillation, while SKD only uses a single generation for the distillation. We attribute our gain to the way we use knowledge distillation to constrain changes in the embedding space, while minimizing the embedding distance between images and their rotated pairs, thus enhancing representation capabilities of the model.

CIFAR-FS, 5-way FC100, 5-way
Method Backbone 1-shot 5-shot 1-shot 5-shot
MAML finn2017model 32-32-32-32 58.9 1.9 71.5 1.0   -   -
Prototypical Networks snell2017prototypical 64-64-64-64 55.5 0.7 72.0 0.6 35.3 0.6 48.6 0.6
Relation Networks sung2018learning 64-96-128-256 55.0 1.0 69.3 0.8   -   -
R2D2 bertinetto2018meta 96-192-384-512 65.3 0.2 79.4 0.1   -   -
TADAM oreshkin2018tadam ResNet-12   -   - 40.1 0.4 56.1 0.4
Shot-Free ravichandran2019few ResNet-12 69.2 n/a 84.7 n/a   -   -
TEWAM qiao2019transductive ResNet-12 70.4 n/a 81.3 n/a   -   -
Prototypical Networks snell2017prototypical ResNet-12 72.2 0.7 83.5 0.5 37.5 0.6 52.5 0.6
Boosting gidaris2019boosting WRN-28-10 73.6 0.3 86.0 0.2   -   -
MetaOptNet lee2019meta ResNet-12 72.6 0.7 84.3 0.5 41.1 0.6 55.5 0.6
RFS-simple tian2020rethink ResNet-12 71.5 0.8 86.0 0.5 42.6 0.7 59.1 0.6
RFS-distill tian2020rethink ResNet-12 73.9 0.8 86.9 0.5 44.6 0.7 60.9 0.6
SKD-GEN0 ResNet-12 74.5 0.9 88.0 0.6 45.3 0.8 62.2 0.7
SKD-GEN1 ResNet-12 76.9 0.9 88.9 0.6 46.5 0.8 63.1 0.7
Table 2: FSL on CIFAR-FS bertinetto2018meta and FC100 oreshkin2018tadam datasets, with mean accurcy and 95% confidence interval. results obtained by training on train+val sets. Table is an extended version from tian2020rethink.

4.2 Ablation Studies and Analysis

Choices of loss function: We study the impact of different contributions by progressively integrating them into our proposed method. To this end, we first evaluate our method with and without the self-supervision loss. If we train the Gen-0 with only cross entropy loss, which is same as RFS-simple tian2020rethink, the model achieve and on 5-way 1-shot task on CIFAR-FS and miniImageNet, respectively. Then, if we train the Gen-0 with additional self supervision, the model performance improves to and . This shows an absolute gain of and , by incorporating our proposed self-supervision. Additionally, if we only keep knowledge distillation for Gen-1, we can see that self-supervision for Gen-0 has a clear impact on next generation. As shown in Table 3, self-supervision at Gen-0 is responsible for performance improvement on Gen-1. Further, during Gen-1, the advantage of using the loss to bring logits of rotated augmentations closer, is demonstrated in Table 3. We can see that, even for both Gen-0 models trained on and , addition of loss during Gen-1 gives about gain compared with using knowledge distillation only. These emprical evaluations clearly establish individual importance of different contributions (self-supervision, knowledge distillation and ensuring proximity of augmented versions of the image in output space) in our proposed two stage approach.

Choices of self supervision: We further investigate different choices of self-supervision. Instead of rotations based self-supervision, we use a crop of an image, and train the final classifier to predict the correct crop quadrant sun2019unsupervised. The results in Table 4 show that the crop-based self-supervision method can also surpass the SOTA FSL methods, though it performs slightly lower than the rotations based self-supervision. We further experiment with simCLR loss chen2020simple, which also aims to bring augmented pairs closer together, along-with knowledge distillation during Gen-1. Our experiments show that simCLR only achieves and on 1 and 5 shot tasks respectively.

Variations of : During Gen-0 of SKD, controls the contribution of self-supervision over classification. Fig. 3 (left) shows the Gen-0 performance by changing . We observe that the performance increases from 0 till 2, and then decreases. The results indicate that the performance is not sensitive to the values of . It is important to note that Gen-0 without self-supervision i.e. performs the lowest compared with other values of , thus establishing the importance of self-supervision.

Variations of : At Gen-1, we again use a coefficient to control the contribution of loss over knowledge distillation. From results in Fig. 3 (right), we observe a similar trend as for the case of , that the performance first improves for , and then decreases with larger values of . However, even if we change from to , the performance drops only by . Note that, the performance on CIFAR-FS, on 5-way 1-shot without loss is only , which is the lowest compared with other values of , showing the importance of .

Time Complexity Lets assume is the time required for training one generation. RFS tian2020rethink has the time complexity of , where is the number of generations, which is usually about 3-4. However, our complexity is . Note that, additional rotation augmentations do not affect the training time much, with parallel computing on GPUs. Also, we generally train Gen-1 model for less number of epochs than Gen-0. Using a single Tesla V100 GPU on CIFAR-FS, for the first generation, both RFS and SKD take approximately the same time, i.e., minutes. The complete training time on CIFAR-FS of RFS is hours, while SKD only takes hours.

CIFAR-FS, 5-way miniImageNet, 5-way
Generation Loss Function 1-shot 5-shot 1-shot 5-shot
GEN-0 71.5 0.8 86.0 0.5 62.02 0.63 79.64 0.44
74.5 0.9 88.0 0.6 65.93 0.81 83.15 0.54
GEN-1 73.9 0.8 86.9 0.5 64.82 0.60 82.14 0.43
74.9 1.0 87.6 0.6 64.76 0.84 81.84 0.54
75.6 0.9 88.7 0.6 66.48 0.84 83.64 0.53
76.9 0.9 88.9 0.6 67.04 0.85 83.54 0.54
Table 3: FSL results on CIFAR-FS bertinetto2018meta and FC100 oreshkin2018tadam, with different combinations of loss functions for Gen-0 and Gen-1. For Gen-1, the loss functions on the left side of the arrow were used to train the Gen-0 model.
Generation 0, 5-way Generation 1, 5-way Self-supervision Type 1-shot 5-shot 1-shot 5-shot None 71.5 0.8 86.0 0.5 73.9 0.8 86.9 0.5 Rotation 74.5 0.9 88.0 0.6 76.9 0.9 88.9 0.6 Location 74.1 0.9 88.0 0.6 76.2 0.9 87.8 0.6
Table 4: Few-shot learning results on CIFAR-FS bertinetto2018meta and FC100 oreshkin2018tadam datasets, with a comparison to no self-supervision vs finding the rotation and finding the location of a patch as self-supervision method.
Figure 3: Ablation study on the sensitivity of the loss coefficient hyper-parameters and .

5 Conclusion

Deep learning models can easily overfit on the scarce data available in FSL settings. To enhance generalizability, existing approaches regularize the model to preserve margins or encode high-level learning behaviour via meta-learning. In this work, we take a different approach and propose to learn the true output classification manifold via self-supervised learning. Our approach operates in two phases: first, the model learns to classify inputs such that the diversity in the outputs is not lost, thereby avoiding overfitting and modeling the natural output manifold structure. Once this structure is learned, our approach trains a student model that preserves the original output manifold structure while jointly maximizing the discriminability of learned representations. Our results on four popular benchmarks show the benefit of our approach where it establishes a new state-of-the-art for FSL.

Broader Impact

This research aims to equip machines with capabilities to learn new concepts using only a few examples. Developing machine learning models which can generalize to a large number of object classes using only a few examples has numerous potential applications with a positive impact on society. Examples include enabling visually impaired individuals to understand the environment around them and enhancing the capabilities of robots being used in healthcare and elderly care facilities. It has the potential to reduce expensive and laborious data acquisition and annotation effort required to learn models in domains including image classification, retrieval, language modelling and object detection. However, we must be cautious that the few shot learning techniques can be misused by authoritarian government agencies which can compromise an individual’s privacy.