1 Introduction
In recent years, deep learning methods have made great strides on several challenging problems
[he2016deep, szegedy2016rethinking, he2017mask, carreira2017quo, chen2018encoder]. This success can be partially attributed to the availability of largescale labeled datasets [imagenet_cvpr09, carreira2017quo, zhou2017places, lin2014microsoft]. However, acquiring large amounts of labeled data is infeasible in several realworld problems due to practical constraints such as the rarity of an event or the high cost of manual annotation. Fewshot learning (FSL) targets this problem by learning a model on a set of base classes and studies its adaptability to novel classes with only a few samples (typically 15) [pmlrv70finn17a, NIPS2016_6385, snell2017prototypical, sung2018learning]. Remarkably, this setting is different from transfer and self/semisupervised learning that assumes the availability of pretrained models
[sharif2014cnn, zamir2018taskonomy, kornblith2019better] or largeamounts of unlabeled data [doersch2015unsupervised, chen2020simple, NIPS2019_8749_MixMatch].FSL has been predominantly solved using ideas from metalearning. The two most dominant approaches are optimizationbased metalearning [pmlrv70finn17a, jamal2019task, rusu2018metalearning] and metriclearning based methods [snell2017prototypical, sung2018learning, pmlrv97allen19b]. Both sets of approaches attempt to train a base learner which can be quickly adapted in the presence of a few novel class examples. However, recently it has been shown in [Raghu2020Rapid] that the quick adaptation of the base learner crucially depends on feature reuse. Other recent works [tian2020rethink, Dhillon2020A, chen2019closerfewshot] have also shown that a baseline feature extractor trained on all the metatrain set can achieve comparable performance to the stateoftheart meta learning based methods. This brings in an interesting question: How much further can FSL performance be pushed by simply improving the base feature extractor?
To answer this question, first, we take a look at the inductive biases in machine learning (ML) algorithms. The optimization of all ML algorithms takes advantage of different inductive biases for hypothesis selection; as the solutions are never unique. The generalization of these algorithms often relies on the effective design of inductive biases, since they encode our priori preference for a particular set of solutions. For instance, regularization methods like
/penalties [tibshirani1996regression], dropout [srivastava2014dropout], or early stopping [prechelt1998early]implicitly impose Occam’s razor in the optimization process by selecting simpler solutions. Likewise, convolutional neural networks (CNN) by design impose translation invariance
[battaglia2018relational] which makes the internal embeddings translation equivariant. Inspired by this, several methods [cohen2016group, finzi2020generalizing, dieleman2016exploiting] have attempted to generalize CNNs by imposing equivariance to different geometric transformations so that the internal structure of data can be modeled more efficiently. On the other hand, methods like [laptev2016ti] try to be robust against nuisance variations by learning transformation invariant features. However, such inductive biases do not provide optimal generalization on FSL tasks and the design of efficient inductive designs for FSL is relatively unexplored.In this paper, we propose a novel feature learning approach by designing an effective set of inductive biases. We observe that the features required to achieve invariance against input transformations can provide better discrimination, but can hurt the generalization. Similarly, features that focus on transformation discrimination are not optimal for class discrimination but learn equivariant properties that help in learning the data structure leading to better transferability. Therefore, we propose to combine the complementary strengths of both feature types through a multitask objective that simultaneously seeks to retain both invariant and equivariant features. We argue that learning such generic features encourages the base feature extractor to be more general. We validate this claim by performing extensive experimentation on multiple benchmark datasets. We also conduct thorough ablation studies to demonstrate that enforcing both equivariance and invariance outperforms enforcing either of these objectives alone (see Fig. 1).
Our main contributions are:

[leftmargin=*]

We enforce complimentary equivariance and invariance to a general set of geometric transformations to model the underlying structure of the data, while remaining discriminative, thereby improving generalization for FSL.

Instead of extensive architectural changes, we propose a simple alternative by defining selfsupervised tasks as auxiliary supervision. For equivariance, we introduce a transformation discrimination task, while an instance discrimination task is developed to learn transformation invariant features.

We demonstrate additional gains with crosstask knowledge distillation that retains the variance properties.
2 Related Works
Fewshot Learning: The FSL approaches generally belong to the metalearning family, which either learn a generalizable metric space [snell2017prototypical, koch2015siamese, vinyals2016matching, oreshkin2018tadam] or apply gradientbased updates to obtain a good initialization. In the first class of methods, Siamese networks related a pair of images [koch2015siamese], matching networks applied an LSTM based context encoder to match query and support set images [vinyals2016matching], and prototypical networks used the distance between the query and the prototype embedding for class assignment [snell2017prototypical]. A taskdependent metric scaling approach to improve FSL was introduced in [oreshkin2018tadam]. The second category use gradientbased metalearning methods that include using a sequence model (e.g., LSTM) to learn generalizable optimization rules [ravi2017optimization], Modelagnostic MetaLearning (MAML) to find a good initialization that can be quickly adapted to new tasks with minimal supervision [pmlrv70finn17a], and Latent Embedding Optimization (LEO) that applied MAML in the low dimensional space from which highdimensional parameters can be generated. A few recent efforts, e.g., ProtoMAML [Triantafillou2020MetaDataset:], combined the complementary strengths of metriclearning and gradientbased metalearning methods.
Inductive Biases in CNNs: Inductive biases reflect our prior knowledge regarding a particular problem. State of the art CNNs are based on such design choices which range from the convolutional operator (e.g., the weight sharing and translational equivariance) [lecun1995convolutional], pooling operator (e.g., local neighbourhood relevance) [Cohen2017InductiveBO], regularization mechanisms (e.g., sparsity with regularizer) [khan2018guide]
, and loss functions (e.g., maxmargin boundaries)
[hayat2019gaussian]. Similarly, recurrent architectures and attention mechanisms are biased towards preserving contextual information and being invariant to time translation [battaglia2018relational]. A number of approaches have been designed to achieve invariance to nuisances such as natural perturbations [hendrycks2019benchmarking, tramer2019adversarial], viewpoint changes [milford2015sequence], and image transformations [cubuk2018autoaugment, buslaev2020albumentations]. On the other hand, equivariant representations have also been investigated to retain knowledge regarding group actions [cohen2016group, qi2020learning, sabour2017dynamic, lenssen2018group], thereby maintaining meaningful structure in the representations. In this work, we advocate that the representations required to simultaneously achieve invariance and equivariance can be useful for generalization to new tasks with limited data.Selfsupervised Learning for FSL:
Our selfsupervised loss is inspired by the recent progress in selfsupervised learning (SSL), where proxy tasks are defined to learn transferable representations without adding any manual annotations
[rajasegaran2020self]. The pretext tasks include colorization
[larsson2016learning, zhang2016colorful], inpainting [pathak2016context], relative patch location [doersch2015unsupervised, noroozi2016unsupervised], and amount of rotation applied [gidaris2018unsupervised]. Recently, the potential of SSL for FSL was explored in [gidaris2019boosting, su2020does]. In [gidaris2019boosting] a parallel branch with the rotation prediction task to help learn generalizable features was added. Su [su2020does] also used rotation and permutation of patches as auxiliary tasks and concluded that SSL is more effective in lowshot regimes and under significant domain shifts. A recent approach employed SimCLR [chen2020simple] style contrastive learning with augmented pairs to learn improved representations in either unsupervised pretraining [medina2020self] or episodic training [doersch2020crosstransformers] for FSL.In contrast to the existing SSL approaches for FSL, we propose to jointly optimize for a complimentary pair of pretext tasks that lead to better generalization. Our novel distillation objective acquires knowledge from the classification as well as proxy task heads and demonstrates further performance improvements. We present our approach next.
3 Our Approach
We first describe the problem setting and the baseline training approach and then present our proposed approach.
3.1 Problem Formulation
Fewshot learning (FSL) operates in two phases, first a model on a set of base classes is trained and then at inference a new set of fewshot classes are received. We define the base training set as , where
is an image, and the onehot encoded label
can belong to a total of base classes. At inference, a data set of fewshot classes is presented for learning such that the label belongs to one of the novel classes, each with a total of examples ( typically ranges between 15). The evaluation setting for fewshot classes is denoted as way, shot. Importantly, the base and fewshot classes belong to totally disjoint sets.For solving the FSL task, most metalearning methods [pmlrv70finn17a, snell2017prototypical, NIPS2016_6385] have leveraged an episodic training scheme. An episode consists of a small train and test set . The examples for the train and test set of an episode are sampled from the same distribution i.e. from the same subset of metatraining classes. Metalearning methods try to optimize the parameters of the base learner by solving a collection of these episodes. The main motivation is that the evaluation conditions should be emulated in the base training stage. However, following recent works [tian2020rethink, Dhillon2020A, chen2019closerfewshot], we do not use an episodic training scheme which allows us to train a single generalizable model that can be efficiently used for anyway, anyshot setting without retraining. Specifically, we train our base learner on the whole base training set in a supervised manner.
Let’s assume our base learner for the FSL task is a neural network,
, parameterized with parameters . The role of this base learner is to extract good feature embeddings that can generalize for novel classes. The base learner can project an input image into the embedding space , such that . Now, to optimize the parameters of the base learnerwe need a classifier to project the extracted embeddings into the label space. To this end, we introduce a classifier function,
, with parameters that projects the embeddings into the label space i.e., , such that .We jointly optimize the parameters of both and by minimizing crossentropy loss on the whole basetraining set . The classification loss is given by,
To regularize the parameters of both of the subnetworks, we add a regularization term. Hence, the learning objective for our baseline training algorithm becomes:
(1) 
Here, is an regularization term for the parameters and . Next, we present our inductive objectives.
3.2 Injecting Inductive Biases through SSL
We propose to enforce equivariance and invariance to a general set of geometric transformations by simply performing selfsupervised learning (SSL). Selfsupervision is particularly useful for learning general features without accessing semantic labels. For representation learning, selfsupervised methods generally aim for either achieving equivariance to some input transformations or learn to discriminate instances by making the representations invariant. To the best of our knowledge, simultaneous equivariance and invariance to a general set of geometric transformations has not been explored in the selfsupervised literature. We are the first ones to do so.
The transformation set can be obtained from a family of geometric transformations, ; . Here, can be interpreted as a family of geometric transformations like Euclidean transformation, Similarity transformation, Affine transformation, and Projective transformation. All of these geometric transformations can be represented with a
matrix with varying degrees of freedom. However, enforcing equivariance and invariance for a continuous space of geometric transformations,
, is difficult and may even lead to suboptimal solutions. To overcome this issue, in this work, we quantize the complete space of affine transformations. We approximate by dividing it into discrete set of transformations. Here, can be selected based on the nature of the data and computation budget.For training, we generate transformed copies of an input image by applying all
transformations. Then we combine all of these transformed images together into a single tensor,
. Here, is the input image transformed through transformation, (the subscript of is dropped in the subsequent discussion for clarity). We send this composite input to the network and optimize for both equivariance and invariance. The training is performed in a multitask fashion. In addition to the classification head, which is needed for the baseline supervised training, two other heads are added on top of the base learner, as shown in Figure 2. One of these heads is used for enforcing equivariance, and the other is used for enforcing invariance. This multitask training scheme ensures that the base learner retains both transformation equivariant and invariant features in the output embedding. We explain each component of our inductive loss below.3.2.1 Enforcing Equivariance
As discussed above, equivariant features help us encode the inherent structure of data that improves generalization of features to new tasks. To enforce equivariance for the set comprising of quantized transformations, we introduce an MLP with parameters . The role of is to project the output embeddings from the base learner into an equivariant space i.e., , where .
In order to train the network, we create proxy labels without any manual supervision. For a specific transformation, a
dimensional onehot encoded vector
(such that ) is used to represent the label for . Once proxy labels are assigned, training is performed in a supervised manner with the crossentropy loss, as follows:(2) 
This supervised training with proxy labels in the equivariant space ensures that the output embedding retains transformation equivariant features.
3.2.2 Enforcing Invariance
While equivariant representations are important to encode the structure in data, they may not be optimal for class discrimination. This is because the transformations we consider are nuisance variations that do not change the image class, therefore a good feature extractor should also encode representations that are independent of these input variations. To enforce invariance to the set consisting of quantized transformations, we introduce another MLP with parameters . The role of is to project the output embeddings from the base learner into an invariant space i.e., where and is the dimension of the invariant embedding.
To optimize for invariance we leverage a contrastive loss [hadsell2006dimensionality] for instance discrimination. We enforce invariance by maximizing the similarity between an embedding corresponding to a transformed image (after undergoing transformation ), and the reference embedding (embedding from the original image without applying any transformation ). Importantly, we note that selecting negatives within a batch is not sufficient to obtain discriminant representations [wu2018unsupervised, misra2020self]. We employ a memory bank in our contrastive loss to sample more negative samples without arbitrarily increasing the batch size. Further, the memory bank allows a stable convergence behavior [misra2020self]. Our learning objective is as follows:
(3) 
where, denotes the transformation index, represents a previous copy of the reference held in the memory and the function is defined as,
Here, is a similarity function, is the temperature, and is the set of negative samples drawn from the memory bank for a particular minibatch. Note that we also maximize the similarity between the reference embedding and its past representation which helps stabilize the learning.
3.2.3 Multihead Distillation
Once the invariant and equivariant representations are learned by our model, we use selfdistillation to train a new model using outputs from the previous model as anchor points (Fig. 2). Note that in typical knowledge distillation [44873], information is exchanged from a larger model (teacher) to a smaller one (student) by matching their softened outputs. In contrast, the outputs from the same models are matched in the selfdistillation [DBLP:conf/icml/FurlanelloLTIA18] where the smooth predictions encode interlabel dependencies, thereby helping the model to learn better representations.
In our case, a simple knowledge distillation by pairing the logits
[tian2020rethink] would not ensure the transfer of invariant and equivariant representations learned by the previous model version. Therefore, we extend the idea of logitbased knowledge distillation and employ it to our invariant and equivariant embedding embeddings. Specifically, in parallel to minimizing the Kullback Leibler (KL) divergence for the soft output of supervised classifier head , we also minimize the KL divergence between the outputs of the equivariant head . Since the output of our invariant headis not a probability distribution, we minimize a
loss for distilling the knowledge at this head. The overall learning objective for knowledge distillation is as follows:(4) 
Here, and are the teacher and student networks for distillation, respectively.
3.2.4 Overall Objective
Finally, we obtain the resultant loss for injecting the desired inductive biases by combining both equivariant , invariant , and multihead distillation losses:
The overall loss is simply a combination of inductive and baseline objectives,
(5) 
3.3 FewShot Evaluation
For evaluation, we test our base learner by sampling FSL tasks from a heldout test set comprising of images from novel classes. Each FSL task contains a support set and a corresponding query set {, }; both contain images from the same subset of test classes. Using , we obtain embeddings for the images of both and . Following [tian2020rethink]
, we train a simple logistic regression classifier based on the image embeddings and the corresponding labels from the
. We use that linear classifier to infer the labels of the query embeddings.4 Experimental Evaluation
Datasets:
We evaluate our method on five popular benchmark FSL datasets. Two of these datasets are subset of the CIFAR100 dataset: CIFARFS
[bertinetto2018metalearning] and FC100 [oreshkin2018tadam]. Another two are derivatives of the ImageNet
[imagenet_cvpr09] dataset: miniImageNet [NIPS2016_6385] and tieredImageNet [ren2018meta]. The CIFARFS dataset is constructed by randomly splitting the 100 classes of the CIFAR100 dataset into 64, 16, and 20 train, validation, and test splits. FC100 dataset makes the FSL task more challenging by making the splits more diverse; the FC100 train, validation, and test splits contain 60, 20, and 20 classes. Following [Ravi2017OptimizationAA], we use 64, 16, and 20 classes of the miniImageNet dataset for training, validation, and testing. The tieredImageNet dataset contains 608 ImageNet classes that are grouped into 34 highlevel categories, and we use 20/351, 6/97, and 8/160 categories/classes for training, validation, and testing. We also evaluate our method on the newly proposed MetaDataset
[Triantafillou2020MetaDataset:], which contains 10 diverse datasets to make the FSL task more challenging and closer to realistic classification problems.



Methods  Backbone  1Shot  5Shot 
MAML[pmlrv70finn17a] 
32323232  
ProtoNet[snell2017prototypical]  64646464  
Relation Net[sung2018learning]  6496128256  
R2D2[bertinetto2018metalearning]  96192384512  
ShotFree[ravichandran2019few]  ResNet12  
TEWAM[qiao2019transductive]  ResNet12  
ProtoNet[snell2017prototypical]  ResNet12  
MetaOptNet[lee2019meta]  ResNet12  
Boosting[gidaris2019boosting]  WRN2810  
Finetuning[Dhillon2020A]  WRN2810  
DSNMR[simon2020adaptive]  ResNet12  
MABAS[10.1007/9783030584528_35]  ResNet12  
RFSSimple[tian2020rethink]  ResNet12  
RFSDistill[tian2020rethink]  ResNet12  
Ours 
ResNet12  
OursDistill  ResNet12  

Average 5way fewshot classification accuracy with 95% confidence intervals on
CIFARFS dataset; trained on both train and validation sets. The top two results are shown in red and blue.



Methods  Backbone  1Shot  5Shot 
ProtoNet[snell2017prototypical] 
64646464  
ProtoNet[snell2017prototypical]  ResNet12  
TADAM[oreshkin2018tadam]  ResNet12  
MetaOptNet[lee2019meta]  ResNet12  
MTL[sun2019meta]  ResNet12  
Finetuning[Dhillon2020A]  WRN2810  
MABAS[10.1007/9783030584528_35]  ResNet12  
RFSSimple[tian2020rethink]  ResNet12  
RFSDistill[tian2020rethink]  ResNet12  
Ours  ResNet12  
OursDistill  ResNet12  




Methods  Backbone  1Shot  5Shot 
MAML[pmlrv70finn17a] 
32323232  
Matching Net [NIPS2016_6385]  64646464  
ProtoNet[snell2017prototypical]  64646464  
Relation Net[sung2018learning]  6496128256  
R2D2[bertinetto2018metalearning]  96192384512  
SNAIL[mishra2018a]  ResNet12  
AdaResNet[pmlrv80munkhdalai18a]  ResNet12  
TADAM[oreshkin2018tadam]  ResNet12  
ShotFree[ravichandran2019few]  ResNet12  
TEWAM[qiao2019transductive]  ResNet12  
MTL[sun2019meta]  ResNet12  
MetaOptNet[lee2019meta]  ResNet12  
Boosting[gidaris2019boosting]  WRN2810  
Finetuning[Dhillon2020A]  WRN2810  
LEOtrainval[rusu2018metalearning]  WRN2810  
Deep DTN[Chen2020DiversityTN]  ResNet12  
AFHN[li2020adversarial]  ResNet18  
AWGIM[guo2020attentive]  WRN2810  
DSNMR[simon2020adaptive]  ResNet12  
MABAS[10.1007/9783030584528_35]  ResNet12  
RFSSimple[tian2020rethink]  ResNet12  
RFSDistill[tian2020rethink]  ResNet12  
Ours  ResNet12  
OursDistill  ResNet12  

Implementation Details: Following [tian2020rethink, mishra2018a, oreshkin2018tadam, lee2019meta], we use a ResNet12 network as our base learner to conduct experiments on CIFARFS, FC100, miniImageNet, tieredImageNet datasets. Following [tian2020rethink, lee2019meta], we also apply Dropblock [ghiasi2018dropblock] regularizer to our Resnet12 base learner. For MetaDataset experiments we use a Resnet18 [he2016deep] network as our base learner to be consistent with [tian2020rethink]. We instantiate both of our equivariant and invariant embedding learners (, ) with an MLP consisting of a single hidden layer. The classifier, , is instantiated with a single linear layer.
We use SGD optimizer with a momentum of 0.9 in all experiments. For CIFARFS, FC100, miniImageNet, tieredImageNet datasets we set the initial learning rate to 0.05 and use a weight decay of
. For experiments on CIFARFS, FC100, miniImageNet datasets, we train for 65 epochs; the learning rate is decayed by a factor of 10 after the first 60 epochs. We train for 60 epochs for experiments on the tieredImageNet dataset; the learning rate is decayed by a factor of 10 for 3 times after the first 30 epochs. For MetaDataset experiments, we set the initial learning rate to 0.1 and use a weight decay of
. We train our method for 90 epochs and decay the learning rate by a factor of 10 every 30 epochs. We use a batch size of 64 in all of our experiments except on MetaDataset where the batch size is set to 256 following [tian2020rethink]. For Metadataset experiments, we use standard data augmentation which includes random horizontal flip and random resized crop. For all the other dataset experiments we use random crop, color jittering and random horizontal flip for data augmentation following [tian2020rethink, lee2019meta]. Consistent with [tian2020rethink], we use a temperature coefficient of 4.0 for our knowledge distillation experiments. For all datasets, we perform one stage of distillation, except the ILSVRC training on Metadataset where we don’t use distillation. We sample 600 FSL tasks to report our score.For our geometric transformations, we sample from a complete space of similarity transformation and use four rotation transformations: {0°, 90°, 180°, 270°}, two scaling transformations: {0.67, 1.0} and three aspect ration transformations: {0.67, 1.0, 1.33}. These geometric transformations can be applied in any combination. For all of our experiments, we set the total number of applied transformations to 16. Additional details and experiments with more geometric transformations are included in the supplementary materials. For the contrastive loss, we use a memory bank that stores 64dimensional embedding of instances; we sample 6400 negative samples from the memory bank for each minibatch and set the value of to 1.0.
4.1 Results



Methods  Backbone  1Shot  5Shot 
MAML[pmlrv70finn17a] 
32323232  
ProtoNet[snell2017prototypical]  64646464  
Relation Net[sung2018learning]  6496128256  
ShotFree[ravichandran2019few]  ResNet12  
MetaOptNet[lee2019meta]  ResNet12  
Boosting[gidaris2019boosting]  WRN2810  
Finetuning[Dhillon2020A]  WRN2810  
LEOtrainval[rusu2018metalearning]  WRN2810  
AWGIM[guo2020attentive]  WRN2810  
DSNMR[simon2020adaptive]  ResNet12  
RFSSimple[tian2020rethink]  ResNet12  
RFSDistill[tian2020rethink]  ResNet12  
Ours 
ResNet12  
OursDistill  ResNet12  

We present our results on five popular benchmark FSL datasets in Table 15 which demonstrates that even without multihead distillation our proposed method consistently outperforms the current stateoftheart (SOTA) FSL methods on both 5way 1shot and 5way 5shot tasks. By virtue of our novel representation learning approach which retains both the transformation invariant and equivariant features in the learned embeddings, our proposed method improves over the baseline RFSSimple [tian2020rethink] method across all datasets by 25% for both 1shot and 5shot tasks. To be more specific, our method outperforms the current best results on CIFARFS dataset (Table 1) by 1.3% in the 1shot task whereas for the 5shot task it improves the score by 2.8%. However, unlike [Dhillon2020A], which achieves the current best results on the CIFARFS 1shot task, we do not perform any transductive finetuning. For FC100 dataset (Table 2) we observe an even bigger improvement; 2.7% and 4.4% for 1 and 5shot, respectively. We see similar trends in miniImageNet and tieredImageNet (Table 3,4) where we consistently improve over the current SOTA methods by 0.72.2%.
For the MetaDataset [Triantafillou2020MetaDataset:], we train our model on the ILSVRC train split and test on 10 diverse datasets. Our results in Table 5 demonstrate that our method outperforms the foProtoMAML [Triantafillou2020MetaDataset:] across all 10 datasets. Even without multihead distillation, we outperform both simple and distilled version of the RFS method on 6 out of 10 datasets. Overall, we perform favorably well against the RFS, achieving a new SOTA result on the challenging MetaDataset.



Dataset  foProtoMAML  RFS  Ours  
LRSimple  LRDistill  
ILSVRC 

Omniglot  
Aircraft  
Birds  
Textures  
Quick Draw  
Fungi  
VGG Flower  
Traffic Signs  
MSCOCO  
Mean Accuracy  

4.2 Ablations



Method  miniImageNet, 5Way  CIFARFS, 5Way  FC100, 5Way  
1Shot  5Shot  1Shot  5Shot  1Shot  5Shot  
Baseline Training 

Ours with only Invariance  
Ours with only Equivariance  
Ours with Equi and Invar (W/O KD)  
Ours with Supervised KD  
Ours Full  

To study the contribution of different components of our method we do a thorough ablation study on three benchmark FSL datasets: miniImageNet, CIFARFS, and FC100 (Table 6). On these three datasets, our baseline supervised training achieves 62.02%, 71.50%, and 42.60% average accuracy on 5way 1shot task respectively; which is the same as RFSSimple [tian2020rethink]. By enforcing invariance we obtain 2.62%, 2%, and 3.5% improvements respectively. Likewise, enforcing equivariance gives 4.07%, 4.87%, and 4.13% improvements over the baseline respectively. On the other hand, we get even bigger improvements by simultaneously optimizing for both equivariance and invariance; achieving 4.8%, 5.33%, and 4.78% improvements on top of the baseline supervised training. Besides, joint training gives 1.3%3.3% improvement over only invariance training and 0.5%0.7% improvement in comparison to only equivariance training. We observe similar trends for 5way 5shot task. This consistent improvement across all datasets for both tasks empirically validates our claim that joint optimization for both equivariance and invariance is beneficial for FSL tasks. Our ablation study also shows that the multihead distillation improves the performance over the standard logitlevel supervised distillation across all datasets.
Effect of the number of Transformations: To investigate the effect of the total number of applied transformations, we perform an ablation study on the CIFARFS validation set by varying the number of transformations, . We present the results in Table 7, which demonstrates that initially, the performance of our method improves with the increasing . However, the performance starts to saturate beyond a particular point. We hypothesize that the performance for an increasing number of transformations decreases since discriminating a higher number of transformations is more difficult and the model spends more representation capability for solving this harder task. A similar trend is observed in [gidaris2018unsupervised], where increasing the number of recognizable rotations does not lead to better performance. Based on Table 7 results, we set the value of to 16 for all of our experiments and do not tune the value from dataset to dataset.



Description  1Shot  5Shot  
3 
AspectRatio  
4  Rotation  
8  Rotation, Scale  
12  AspectRatio, Rotation  
16  AspectRatio, Rotation, Scale  
20  AspectRatio, Rotation, Scale  

4.3 Analysis
We do a tSNE visualization of the output embeddings from for the test images of miniImageNet to demonstrate the effectiveness of our method (see Fig. 3). We observe that the base learner trained in a supervised manner can retain good class discrimination even for unseen test classes. However, as evident in Fig. 3, the class boundaries are not precise and compact. Enforcing invariance on top of the base learner leads to more compact class boundaries; however, the sample embeddings of different classes are still relatively closer to one another. On the other hand, enforcing equivariance leads to class representations that are well spread out since it retains the transformation equivariant information in the embedding space. Finally, our proposed method takes advantage of both of these complementary properties and generates embeddings that lead to more compact clusters and discriminative class boundaries.
4.4 Alternate SelfSupervision Losses



Method  1Shot  5Shot 
Baseline Training 

Baseline + Jigsaw Puzzle [noroozi2016unsupervised]  
Baseline + Location Pred [sun2019unsupervised]  
Baseline + Context Pred [doersch2015unsupervised]  
Baseline + Rotation [gidaris2018unsupervised]  
Ours (W/O KD)  

In Table 8, to further analyze the performance improvement of our method, we conduct a set of experiments where commonly used selfsupervised objectives like solving jigsaw puzzles [noroozi2016unsupervised], patch location prediction [sun2019unsupervised], context prediction [doersch2015unsupervised], rotation classification [gidaris2018unsupervised] are added on top of the base learner as an auxiliary task. We found that our proposed method which aims to learn representations that retain both transformation invariant and equivariant information outperforms all of these SSL tasks by a good margin. Besides, we have noticed that the patchbased SSL tasks [noroozi2016unsupervised, sun2019unsupervised, doersch2015unsupervised] generally underperform in comparison to SSL tasks that rely on changing the global statistics of the image while maintaining the local statistics; this conclusion is in line with the experimental results from [gidaris2019boosting].
5 Conclusion
In this work, we explored a set of inductive biases that help us learn highly discriminative and transferable representations for FSL. Specifically, we showed that simultaneously learning equivariant and invariant representations to a set of generic transformations results in retaining a complimentary set of features that work well for novel classes. We also designed a novel multihead knowledge distillation objective which delivers additional gains. We conducted extensive ablation to empirically validate our claim that joint optimization for invariance and equivariance leads to more generic and transferable features. We obtained new stateoftheart results on four popular benchmark FSL datasets as well as on the newly proposed challenging MetaDataset.
References
Appendix A Supplementary Materials Overview
In the supplementary materials we include the following: additional details about the applied geometric transformations (Section B), additional results with the transformations sampled from the complete space of affine transformations (Section C), ablation study on the coefficient of inductive loss (Section D), ablation study on the temperature of knowledge distillation (Section E), effect of successive self knowledge distillation (Section F), additional results on MetaDataset with multihead distillation (Section G), and effect of enforcing invariance and equivariance for supervised classification (Section H).
Appendix B Geometric Transformations
For our geometric transformations, we sample from a complete space of similarity transformation and use four rotation transformations: {0°, 90°, 180°, 270°}, two scaling transformations: {0.67, 1.0} and three aspect ratio transformations: {0.67, 1.0, 1.33}. Different combinations of these transformations lead to different values of (total number of applied transformations). An ablation study on the value of is included in section 4.2 of the main paper. In Table 9 we include the complete description of different values of that we use in our experiments.



Description  
3 
AR:{0.67, 1.0, 1.33} 
4  ROT:{0°, 90°, 180°, 270°} 
8  ROT:{0°, 90°, 180°, 270°}S:{0.67, 1.0} 
12  AR:{0.67, 1.0, 1.33}ROT:{0°, 90°, 180°, 270°} 
16 
(AR:{0.67, 1.0, 1.33}ROT:{0°, 90°, 180°, 270°}) (ROT:{0°, 90°, 180°, 270°}S:{0.67}) 
20 
(AR:{0.67, 1.0, 1.33}ROT:{0°, 90°, 180°, 270°}) (ROT:{0°, 90°, 180°, 270°}S:{0.67}AR:{0.67, 1.33}) 
24  AR:{0.67, 1.0, 1.33}ROT:{0°, 90°, 180°, 270°}S:{0.67, 1.0} 

Appendix C Additonal Resutls with Affine Transformations
We perform a set of experiments where the objective is to sample geometric transformation from the complete space of affine transformations. To this end, we quantize the affine transformation space according to Table 10. This leads to 972 distinct geometric transformations. Since it’s not feasible to apply all the 972 transformations on an input image to obtain the input tensor , we randomly sample 10 geometric transformations from the set of 972 transformations. We apply these randomly sampled 10 geometric transformations on an input image and generate the input tensor . The results of these experiments are presented in Table 11. From Table 11 it’s evident that training with either invariance or equivariance improves over the baseline training for both 1 and 5 shot tasks (2.53.7% improvement). Joint optimization for both invariance and equivariance provides additional improvement of 1%. Even though the experiments with geometric transformations sampled from the complete affine transformation space do not improve over the training with (description of is available in Table 9), the experiments demonstrate consistent improvement when both invariance and equivariance are enforced simultaneously. This provides additional support for our claim that enforcing both invariance and equivariance is beneficial for learning good general representations for solving challenging FSL tasks.



Transformation  Quantized Values 
Rotation 
{0°, 90°, 180°, 270°} 
Translation(X)  {0.2, 0.0, 0.2 } 
Translation(Y)  {0.2, 0.0, 0.2} 
Scale  {0.67, 1.0, 1.33} 
AspectRatio  {0.67, 1.0, 1.33} 
Shear  {20°, 0°, 20°} 




Method  1Shot  5Shot 
Baseline Training 

Ours with only Invar (affine)  
Ours with only Equi (affine)  
Ours with Equi and Invar (affine)  
Ours with Equi and Invar (=16)  

Appendix D Ablation Study for Coefficient of Inductive Loss
We conduct an ablation study to measure the effect of different values of the coefficient of inductive loss (without multihead distillation) on the CIFARFS [bertinetto2018metalearning] validation set; the results of 5way 1shot FSL tasks are presented in fig. 4. From fig.4 it is evident that the proposed method is fairly robust to the different values of the coefficient of the inductive loss. However, the best performance is obtained when we set the loss coefficient to 1.0. Based on this ablation study, we use a loss coefficient of 1.0 for the inductive loss in all of our experiments.
Appendix E Ablation Study for Knowledge Distillation Temperature
To analyse the effect of knowledge distillation temperature (for Kullback Leibler (KL) divergence losses) we conduct an ablation study on the validation set of CIFARFS [bertinetto2018metalearning] dataset. From fig. 5 we can observe that the proposed method with multihead distillation objective is not very sensitive to the temperature coefficient of knowledge distillation. The proposed method achieves similar performance on the CIFARFS validation set when the value of distillation temperature is set to 4.0 and 5.0. Based on this ablation study and to be consistent with [tian2020rethink], we set the value of the coefficient of knowledge distillation temperature to 4.0 in all of our experiments.
Appendix F Effect of Successive Distillation
In all of our experiments, we use only one stage of multihead knowledge distillation. To further investigate the effect of knowledge distillation we perform multiple stages of self knowledge distillation on CIFARFS [bertinetto2018metalearning] dataset. The results are presented in fig. 6. Here, the 0 distillation stage is the base learner trained with only the supervised baseline loss (), equivariant loss (), and invariant loss (). From fig. 6, we observe that the performance in the FSL task improves for the first 2 stages of distillation, after that the performance saturates. Besides, the improvement from stage 1 to stage 2 is minimal ( 0.1%). Therefore, to make the proposed method more computationally efficient we perform only one stage of distillation in all of our experiments.
Appendix G Additional Results on MetaDataset
In Table 12, we provide additional results on metadataset [Triantafillou2020MetaDataset:] which demonstrates that our proposed method can provide additional improvement (0.5%) with the proposed multihead knowledge distillation objective. We also outperform the distilled version of the current stateoftheart method RFS [tian2020rethink] by 0.6% in terms of mean accuracy across all the 10 datasets. Considering the challenging nature of MetaDataset, this improvement is significant.



Dataset  RFSDistill  Ours  OursDistill 
ILSVRC 

Omniglot  
Aircraft  
Birds  
Textures  
Quick Draw  
Fungi  
VGG Flower  
Traffic Signs  
MSCOCO  
Mean Accuracy  

Appendix H Invariance and Equivariance for Supervised Classification
To demonstrate the effectiveness of complementary strengths of invariant and equivariant representations we conduct fully supervised classification experiments on benchmark CIFAR100 dataset [CIFAR100]. For these experiments, we use the standard WideResnet2810 [BMVC2016_87] architecture as the backbone. For training, we use an SGD optimizer with an initial learning rate of 0.1. We set the momentum to 0.9 and use a weight decay of 5e4. For all the experiments, the training is performed for 200 epochs where the learning rate is decayed by a factor of 5 at epochs 60, 120, and 160. We use a batch size of 128 for all the experiments as well as a dropout rate of 0.3. The training augmentations include standard data augmentations: random crop and random horizontal flip. For enforcing invariance and equivariance, we set the value of to 12 for computational efficiency; description of is available in Table 9. We do not perform knowledge distillation for these experiments. The results of these experiments are presented in Table 13.
From Table 13, we can notice that enforcing invariance provides little improvement (0.2%) over the supervised baseline. This is expected since the train and test data is coming from the same distribution and same set of classes; making the class boundaries compact (for seen classes) doesn’t provide that much additional benefit. However, in the case of FSL we observe that enforcing invariance over baseline provides 2.62%, 2%, and 3.5% improvement for miniImageNet [NIPS2016_6385], CIFARFS [bertinetto2018metalearning], and FC100 [oreshkin2018tadam] datasets respectively (section 4.2 of main text). On the other hand, enforcing equivariance for supervised classification provides better improvement (1.8%) since it helps the model to better learn the structure of data. Even though enforcing equivariance provides noticeable improvement for supervised classification, in the case of FSL we obtain a much bigger improvement of 4.07%, 4.87%, and 4.13% for miniImageNet [NIPS2016_6385], CIFARFS [bertinetto2018metalearning], and FC100 [oreshkin2018tadam] datasets respectively (section 4.2 of main text). Finally, joint optimization for both invariance and equivariance achieves the best performance and provides minimal but consistent improvement of 0.1% over enforcing only equivariance. However, joint optimization provides a much larger improvement on FSL tasks (see section 4.2 of the main text). From these experiments, we conclude that, although enforcing both invariance and equivariance is beneficial for supervised classification, injecting these inductive biases is more crucial for FSL tasks since the inductive inference for FSL tasks is more challenging (inference on unseen/novel classes).



Method  Error Rate (%) 
Supervised Baseline 

Ours with only Invariance  
Ours with only Equivariance  
Ours with Equi and Invar (W/O KD)  
