Exploring Complementary Strengths of Invariant and Equivariant Representations for Few-Shot Learning

03/01/2021 ∙ by Mamshad Nayeem Rizve, et al. ∙ 0

In many real-world problems, collecting a large number of labeled samples is infeasible. Few-shot learning (FSL) is the dominant approach to address this issue, where the objective is to quickly adapt to novel categories in presence of a limited number of samples. FSL tasks have been predominantly solved by leveraging the ideas from gradient-based meta-learning and metric learning approaches. However, recent works have demonstrated the significance of powerful feature representations with a simple embedding network that can outperform existing sophisticated FSL algorithms. In this work, we build on this insight and propose a novel training mechanism that simultaneously enforces equivariance and invariance to a general set of geometric transformations. Equivariance or invariance has been employed standalone in the previous works; however, to the best of our knowledge, they have not been used jointly. Simultaneous optimization for both of these contrasting objectives allows the model to jointly learn features that are not only independent of the input transformation but also the features that encode the structure of geometric transformations. These complementary sets of features help generalize well to novel classes with only a few data samples. We achieve additional improvements by incorporating a novel self-supervised distillation objective. Our extensive experimentation shows that even without knowledge distillation our proposed method can outperform current state-of-the-art FSL methods on five popular benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep learning methods have made great strides on several challenging problems

[he2016deep, szegedy2016rethinking, he2017mask, carreira2017quo, chen2018encoder]. This success can be partially attributed to the availability of large-scale labeled datasets [imagenet_cvpr09, carreira2017quo, zhou2017places, lin2014microsoft]. However, acquiring large amounts of labeled data is infeasible in several real-world problems due to practical constraints such as the rarity of an event or the high cost of manual annotation. Few-shot learning (FSL) targets this problem by learning a model on a set of base classes and studies its adaptability to novel classes with only a few samples (typically 1-5) [pmlr-v70-finn17a, NIPS2016_6385, snell2017prototypical, sung2018learning]

. Remarkably, this setting is different from transfer and self/semi-supervised learning that assumes the availability of pretrained models

[sharif2014cnn, zamir2018taskonomy, kornblith2019better] or large-amounts of unlabeled data [doersch2015unsupervised, chen2020simple, NIPS2019_8749_MixMatch].

Figure 1: Approach Overview: Shapes represent different transformations and colors represent different classes. While the invariant features provide better discrimination, the equivariant features help us learn the internal structure of the data manifold. These complimentary representations help us generalize better to new tasks with only a few training samples. By jointly leveraging the strengths of equivariant and invariant features, our method achieves significant improvement over baseline (bottom row).

FSL has been predominantly solved using ideas from meta-learning. The two most dominant approaches are optimization-based meta-learning [pmlr-v70-finn17a, jamal2019task, rusu2018metalearning] and metric-learning based methods [snell2017prototypical, sung2018learning, pmlr-v97-allen19b]. Both sets of approaches attempt to train a base learner which can be quickly adapted in the presence of a few novel class examples. However, recently it has been shown in [Raghu2020Rapid] that the quick adaptation of the base learner crucially depends on feature reuse. Other recent works [tian2020rethink, Dhillon2020A, chen2019closerfewshot] have also shown that a baseline feature extractor trained on all the meta-train set can achieve comparable performance to the state-of-the-art meta learning based methods. This brings in an interesting question: How much further can FSL performance be pushed by simply improving the base feature extractor?

To answer this question, first, we take a look at the inductive biases in machine learning (ML) algorithms. The optimization of all ML algorithms takes advantage of different inductive biases for hypothesis selection; as the solutions are never unique. The generalization of these algorithms often relies on the effective design of inductive biases, since they encode our priori preference for a particular set of solutions. For instance, regularization methods like

/-penalties [tibshirani1996regression], dropout [srivastava2014dropout], or early stopping [prechelt1998early]

implicitly impose Occam’s razor in the optimization process by selecting simpler solutions. Likewise, convolutional neural networks (CNN) by design impose translation invariance

[battaglia2018relational] which makes the internal embeddings translation equivariant. Inspired by this, several methods [cohen2016group, finzi2020generalizing, dieleman2016exploiting] have attempted to generalize CNNs by imposing equivariance to different geometric transformations so that the internal structure of data can be modeled more efficiently. On the other hand, methods like [laptev2016ti] try to be robust against nuisance variations by learning transformation invariant features. However, such inductive biases do not provide optimal generalization on FSL tasks and the design of efficient inductive designs for FSL is relatively unexplored.

In this paper, we propose a novel feature learning approach by designing an effective set of inductive biases. We observe that the features required to achieve invariance against input transformations can provide better discrimination, but can hurt the generalization. Similarly, features that focus on transformation discrimination are not optimal for class discrimination but learn equivariant properties that help in learning the data structure leading to better transferability. Therefore, we propose to combine the complementary strengths of both feature types through a multi-task objective that simultaneously seeks to retain both invariant and equivariant features. We argue that learning such generic features encourages the base feature extractor to be more general. We validate this claim by performing extensive experimentation on multiple benchmark datasets. We also conduct thorough ablation studies to demonstrate that enforcing both equivariance and invariance outperforms enforcing either of these objectives alone (see Fig. 1).

Our main contributions are:

  • [leftmargin=*]

  • We enforce complimentary equivariance and invariance to a general set of geometric transformations to model the underlying structure of the data, while remaining discriminative, thereby improving generalization for FSL.

  • Instead of extensive architectural changes, we propose a simple alternative by defining self-supervised tasks as auxiliary supervision. For equivariance, we introduce a transformation discrimination task, while an instance discrimination task is developed to learn transformation invariant features.

  • We demonstrate additional gains with cross-task knowledge distillation that retains the variance properties.

2 Related Works

Few-shot Learning: The FSL approaches generally belong to the meta-learning family, which either learn a generalizable metric space [snell2017prototypical, koch2015siamese, vinyals2016matching, oreshkin2018tadam] or apply gradient-based updates to obtain a good initialization. In the first class of methods, Siamese networks related a pair of images [koch2015siamese], matching networks applied an LSTM based context encoder to match query and support set images [vinyals2016matching], and prototypical networks used the distance between the query and the prototype embedding for class assignment [snell2017prototypical]. A task-dependent metric scaling approach to improve FSL was introduced in [oreshkin2018tadam]. The second category use gradient-based meta-learning methods that include using a sequence model (e.g., LSTM) to learn generalizable optimization rules [ravi2017optimization], Model-agnostic Meta-Learning (MAML) to find a good initialization that can be quickly adapted to new tasks with minimal supervision [pmlr-v70-finn17a], and Latent Embedding Optimization (LEO) that applied MAML in the low dimensional space from which high-dimensional parameters can be generated. A few recent efforts, e.g., ProtoMAML [Triantafillou2020Meta-Dataset:], combined the complementary strengths of metric-learning and gradient-based meta-learning methods.

Inductive Biases in CNNs: Inductive biases reflect our prior knowledge regarding a particular problem. State of the art CNNs are based on such design choices which range from the convolutional operator (e.g., the weight sharing and translational equivariance) [lecun1995convolutional], pooling operator (e.g., local neighbourhood relevance) [Cohen2017InductiveBO], regularization mechanisms (e.g., sparsity with regularizer) [khan2018guide]

, and loss functions (e.g., max-margin boundaries)

[hayat2019gaussian]. Similarly, recurrent architectures and attention mechanisms are biased towards preserving contextual information and being invariant to time translation [battaglia2018relational]. A number of approaches have been designed to achieve invariance to nuisances such as natural perturbations [hendrycks2019benchmarking, tramer2019adversarial], viewpoint changes [milford2015sequence], and image transformations [cubuk2018autoaugment, buslaev2020albumentations]. On the other hand, equivariant representations have also been investigated to retain knowledge regarding group actions [cohen2016group, qi2020learning, sabour2017dynamic, lenssen2018group], thereby maintaining meaningful structure in the representations. In this work, we advocate that the representations required to simultaneously achieve invariance and equivariance can be useful for generalization to new tasks with limited data.

Self-supervised Learning for FSL:

Our self-supervised loss is inspired by the recent progress in self-supervised learning (SSL), where proxy tasks are defined to learn transferable representations without adding any manual annotations

[rajasegaran2020self]

. The pretext tasks include colorization

[larsson2016learning, zhang2016colorful], inpainting [pathak2016context], relative patch location [doersch2015unsupervised, noroozi2016unsupervised], and amount of rotation applied [gidaris2018unsupervised]. Recently, the potential of SSL for FSL was explored in [gidaris2019boosting, su2020does]. In [gidaris2019boosting] a parallel branch with the rotation prediction task to help learn generalizable features was added. Su [su2020does] also used rotation and permutation of patches as auxiliary tasks and concluded that SSL is more effective in low-shot regimes and under significant domain shifts. A recent approach employed SimCLR [chen2020simple] style contrastive learning with augmented pairs to learn improved representations in either unsupervised pretraining [medina2020self] or episodic training [doersch2020crosstransformers] for FSL.

In contrast to the existing SSL approaches for FSL, we propose to jointly optimize for a complimentary pair of pretext tasks that lead to better generalization. Our novel distillation objective acquires knowledge from the classification as well as proxy task heads and demonstrates further performance improvements. We present our approach next.

3 Our Approach

Figure 2: Network Architecture during Training: A series of transformed inputs (transformed by applying transformations ) are provided to a shared feature extractor . The resulting embedding is forwarded to three parallel heads and that focus on learning equivariant features, discriminative class boundaries, and invariant features, respectively. The resulting output representations are distilled from an old copy of the model (teacher model on the right) across multiple-heads to further improve the encoded representations. Notably, a dedicated memory bank of negative samples helps stabilize our invariant contrastive learning.

We first describe the problem setting and the baseline training approach and then present our proposed approach.

3.1 Problem Formulation

Few-shot learning (FSL) operates in two phases, first a model on a set of base classes is trained and then at inference a new set of few-shot classes are received. We define the base training set as , where

is an image, and the one-hot encoded label

can belong to a total of base classes. At inference, a data set of few-shot classes is presented for learning such that the label belongs to one of the novel classes, each with a total of examples ( typically ranges between 1-5). The evaluation setting for few-shot classes is denoted as -way, -shot. Importantly, the base and few-shot classes belong to totally disjoint sets.

For solving the FSL task, most meta-learning methods [pmlr-v70-finn17a, snell2017prototypical, NIPS2016_6385] have leveraged an episodic training scheme. An episode consists of a small train and test set . The examples for the train and test set of an episode are sampled from the same distribution i.e. from the same subset of meta-training classes. Meta-learning methods try to optimize the parameters of the base learner by solving a collection of these episodes. The main motivation is that the evaluation conditions should be emulated in the base training stage. However, following recent works [tian2020rethink, Dhillon2020A, chen2019closerfewshot], we do not use an episodic training scheme which allows us to train a single generalizable model that can be efficiently used for any-way, any-shot setting without retraining. Specifically, we train our base learner on the whole base training set in a supervised manner.

Let’s assume our base learner for the FSL task is a neural network,

, parameterized with parameters . The role of this base learner is to extract good feature embeddings that can generalize for novel classes. The base learner can project an input image into the embedding space , such that . Now, to optimize the parameters of the base learner

we need a classifier to project the extracted embeddings into the label space. To this end, we introduce a classifier function,

, with parameters that projects the embeddings into the label space i.e., , such that .

We jointly optimize the parameters of both and by minimizing cross-entropy loss on the whole base-training set . The classification loss is given by,

To regularize the parameters of both of the sub-networks, we add a regularization term. Hence, the learning objective for our baseline training algorithm becomes:

(1)

Here, is an regularization term for the parameters and . Next, we present our inductive objectives.

3.2 Injecting Inductive Biases through SSL

We propose to enforce equivariance and invariance to a general set of geometric transformations by simply performing self-supervised learning (SSL). Self-supervision is particularly useful for learning general features without accessing semantic labels. For representation learning, self-supervised methods generally aim for either achieving equivariance to some input transformations or learn to discriminate instances by making the representations invariant. To the best of our knowledge, simultaneous equivariance and invariance to a general set of geometric transformations has not been explored in the self-supervised literature. We are the first ones to do so.

The transformation set can be obtained from a family of geometric transformations, ; . Here, can be interpreted as a family of geometric transformations like Euclidean transformation, Similarity transformation, Affine transformation, and Projective transformation. All of these geometric transformations can be represented with a

matrix with varying degrees of freedom. However, enforcing equivariance and invariance for a continuous space of geometric transformations,

, is difficult and may even lead to suboptimal solutions. To overcome this issue, in this work, we quantize the complete space of affine transformations. We approximate by dividing it into discrete set of transformations. Here, can be selected based on the nature of the data and computation budget.

For training, we generate transformed copies of an input image by applying all

transformations. Then we combine all of these transformed images together into a single tensor,

. Here, is the input image transformed through transformation, (the subscript of is dropped in the subsequent discussion for clarity). We send this composite input to the network and optimize for both equivariance and invariance. The training is performed in a multi-task fashion. In addition to the classification head, which is needed for the baseline supervised training, two other heads are added on top of the base learner, as shown in Figure 2. One of these heads is used for enforcing equivariance, and the other is used for enforcing invariance. This multi-task training scheme ensures that the base learner retains both transformation equivariant and invariant features in the output embedding. We explain each component of our inductive loss below.

3.2.1 Enforcing Equivariance

As discussed above, equivariant features help us encode the inherent structure of data that improves generalization of features to new tasks. To enforce equivariance for the set comprising of quantized transformations, we introduce an MLP with parameters . The role of is to project the output embeddings from the base learner into an equivariant space i.e., , where .

In order to train the network, we create proxy labels without any manual supervision. For a specific transformation, a

dimensional one-hot encoded vector

(such that ) is used to represent the label for . Once proxy labels are assigned, training is performed in a supervised manner with the cross-entropy loss, as follows:

(2)

This supervised training with proxy labels in the equivariant space ensures that the output embedding retains transformation equivariant features.

3.2.2 Enforcing Invariance

While equivariant representations are important to encode the structure in data, they may not be optimal for class discrimination. This is because the transformations we consider are nuisance variations that do not change the image class, therefore a good feature extractor should also encode representations that are independent of these input variations. To enforce invariance to the set consisting of quantized transformations, we introduce another MLP with parameters . The role of is to project the output embeddings from the base learner into an invariant space i.e., where and is the dimension of the invariant embedding.

To optimize for invariance we leverage a contrastive loss [hadsell2006dimensionality] for instance discrimination. We enforce invariance by maximizing the similarity between an embedding corresponding to a transformed image (after undergoing transformation ), and the reference embedding (embedding from the original image without applying any transformation ). Importantly, we note that selecting negatives within a batch is not sufficient to obtain discriminant representations [wu2018unsupervised, misra2020self]. We employ a memory bank in our contrastive loss to sample more negative samples without arbitrarily increasing the batch size. Further, the memory bank allows a stable convergence behavior [misra2020self]. Our learning objective is as follows:

(3)

where, denotes the transformation index, represents a previous copy of the reference held in the memory and the function is defined as,

Here, is a similarity function, is the temperature, and is the set of negative samples drawn from the memory bank for a particular minibatch. Note that we also maximize the similarity between the reference embedding and its past representation which helps stabilize the learning.

3.2.3 Multi-head Distillation

Once the invariant and equivariant representations are learned by our model, we use self-distillation to train a new model using outputs from the previous model as anchor points (Fig. 2). Note that in typical knowledge distillation [44873], information is exchanged from a larger model (teacher) to a smaller one (student) by matching their softened outputs. In contrast, the outputs from the same models are matched in the self-distillation [DBLP:conf/icml/FurlanelloLTIA18] where the smooth predictions encode inter-label dependencies, thereby helping the model to learn better representations.

In our case, a simple knowledge distillation by pairing the logits

[tian2020rethink] would not ensure the transfer of invariant and equivariant representations learned by the previous model version. Therefore, we extend the idea of logit-based knowledge distillation and employ it to our invariant and equivariant embedding embeddings. Specifically, in parallel to minimizing the Kullback Leibler (KL) divergence for the soft output of supervised classifier head , we also minimize the KL divergence between the outputs of the equivariant head . Since the output of our invariant head

is not a probability distribution, we minimize a

loss for distilling the knowledge at this head. The overall learning objective for knowledge distillation is as follows:

(4)

Here, and are the teacher and student networks for distillation, respectively.

3.2.4 Overall Objective

Finally, we obtain the resultant loss for injecting the desired inductive biases by combining both equivariant , invariant , and multi-head distillation losses:

The overall loss is simply a combination of inductive and baseline objectives,

(5)

3.3 Few-Shot Evaluation

For evaluation, we test our base learner by sampling FSL tasks from a held-out test set comprising of images from novel classes. Each FSL task contains a support set and a corresponding query set {, }; both contain images from the same subset of test classes. Using , we obtain embeddings for the images of both and . Following [tian2020rethink]

, we train a simple logistic regression classifier based on the image embeddings and the corresponding labels from the

. We use that linear classifier to infer the labels of the query embeddings.

4 Experimental Evaluation

Datasets:

We evaluate our method on five popular benchmark FSL datasets. Two of these datasets are subset of the CIFAR100 dataset: CIFAR-FS

[bertinetto2018metalearning] and FC100 [oreshkin2018tadam]

. Another two are derivatives of the ImageNet

[imagenet_cvpr09] dataset: miniImageNet [NIPS2016_6385] and tieredImageNet [ren2018meta]. The CIFAR-FS dataset is constructed by randomly splitting the 100 classes of the CIFAR-100 dataset into 64, 16, and 20 train, validation, and test splits. FC100 dataset makes the FSL task more challenging by making the splits more diverse; the FC100 train, validation, and test splits contain 60, 20, and 20 classes. Following [Ravi2017OptimizationAA]

, we use 64, 16, and 20 classes of the miniImageNet dataset for training, validation, and testing. The tieredImageNet dataset contains 608 ImageNet classes that are grouped into 34 high-level categories, and we use 20/351, 6/97, and 8/160 categories/classes for training, validation, and testing. We also evaluate our method on the newly proposed Meta-Dataset

[Triantafillou2020Meta-Dataset:], which contains 10 diverse datasets to make the FSL task more challenging and closer to realistic classification problems.



Methods Backbone 1-Shot 5-Shot


MAML[pmlr-v70-finn17a]
32-32-32-32
Proto-Net[snell2017prototypical] 64-64-64-64
Relation Net[sung2018learning] 64-96-128-256
R2D2[bertinetto2018metalearning] 96-192-384-512
Shot-Free[ravichandran2019few] ResNet-12
TEWAM[qiao2019transductive] ResNet-12
Proto-Net[snell2017prototypical] ResNet-12
MetaOptNet[lee2019meta] ResNet-12
Boosting[gidaris2019boosting] WRN-28-10
Fine-tuning[Dhillon2020A] WRN-28-10
DSN-MR[simon2020adaptive] ResNet-12
MABAS[10.1007/978-3-030-58452-8_35] ResNet-12
RFS-Simple[tian2020rethink] ResNet-12
RFS-Distill[tian2020rethink] ResNet-12

Ours
ResNet-12
Ours-Distill ResNet-12


Table 1:

Average 5-way few-shot classification accuracy with 95% confidence intervals on

CIFAR-FS dataset; trained on both train and validation sets. The top two results are shown in red and blue.


Methods Backbone 1-Shot 5-Shot


Proto-Net[snell2017prototypical]
64-64-64-64
Proto-Net[snell2017prototypical] ResNet-12
TADAM[oreshkin2018tadam] ResNet-12
MetaOptNet[lee2019meta] ResNet-12
MTL[sun2019meta] ResNet-12
Fine-tuning[Dhillon2020A] WRN-28-10
MABAS[10.1007/978-3-030-58452-8_35] ResNet-12
RFS-Simple[tian2020rethink] ResNet-12
RFS-Distill[tian2020rethink] ResNet-12
Ours ResNet-12
Ours-Distill ResNet-12


Table 2: Average 5-way few-shot classification accuracy with 95% confidence intervals on FC100 dataset; trained on both train and validation sets. The top two results are shown in red and blue.


Methods Backbone 1-Shot 5-Shot


MAML[pmlr-v70-finn17a]
32-32-32-32
Matching Net [NIPS2016_6385] 64-64-64-64
Proto-Net[snell2017prototypical] 64-64-64-64
Relation Net[sung2018learning] 64-96-128-256
R2D2[bertinetto2018metalearning] 96-192-384-512
SNAIL[mishra2018a] ResNet-12
AdaResNet[pmlr-v80-munkhdalai18a] ResNet-12
TADAM[oreshkin2018tadam] ResNet-12
Shot-Free[ravichandran2019few] ResNet-12
TEWAM[qiao2019transductive] ResNet-12
MTL[sun2019meta] ResNet-12
MetaOptNet[lee2019meta] ResNet-12
Boosting[gidaris2019boosting] WRN-28-10
Fine-tuning[Dhillon2020A] WRN-28-10
LEO-trainval[rusu2018metalearning] WRN-28-10
Deep DTN[Chen2020DiversityTN] ResNet-12
AFHN[li2020adversarial] ResNet-18
AWGIM[guo2020attentive] WRN-28-10
DSN-MR[simon2020adaptive] ResNet-12
MABAS[10.1007/978-3-030-58452-8_35] ResNet-12
RFS-Simple[tian2020rethink] ResNet-12
RFS-Distill[tian2020rethink] ResNet-12
Ours ResNet-12
Ours-Distill ResNet-12


Table 3: Average 5-way few-shot classification accuracy with 95% confidence intervals on miniImageNet dataset; trained on both train and validation sets. The top two results are shown in red and blue.

Implementation Details: Following [tian2020rethink, mishra2018a, oreshkin2018tadam, lee2019meta], we use a ResNet-12 network as our base learner to conduct experiments on CIFAR-FS, FC100, miniImageNet, tieredImageNet datasets. Following [tian2020rethink, lee2019meta], we also apply Dropblock [ghiasi2018dropblock] regularizer to our Resnet-12 base learner. For Meta-Dataset experiments we use a Resnet-18 [he2016deep] network as our base learner to be consistent with [tian2020rethink]. We instantiate both of our equivariant and invariant embedding learners (, ) with an MLP consisting of a single hidden layer. The classifier, , is instantiated with a single linear layer.

We use SGD optimizer with a momentum of 0.9 in all experiments. For CIFAR-FS, FC100, miniImageNet, tieredImageNet datasets we set the initial learning rate to 0.05 and use a weight decay of

. For experiments on CIFAR-FS, FC100, miniImageNet datasets, we train for 65 epochs; the learning rate is decayed by a factor of 10 after the first 60 epochs. We train for 60 epochs for experiments on the tieredImageNet dataset; the learning rate is decayed by a factor of 10 for 3 times after the first 30 epochs. For Meta-Dataset experiments, we set the initial learning rate to 0.1 and use a weight decay of

. We train our method for 90 epochs and decay the learning rate by a factor of 10 every 30 epochs. We use a batch size of 64 in all of our experiments except on Meta-Dataset where the batch size is set to 256 following [tian2020rethink]. For Meta-dataset experiments, we use standard data augmentation which includes random horizontal flip and random resized crop. For all the other dataset experiments we use random crop, color jittering and random horizontal flip for data augmentation following [tian2020rethink, lee2019meta]. Consistent with [tian2020rethink], we use a temperature coefficient of 4.0 for our knowledge distillation experiments. For all datasets, we perform one stage of distillation, except the ILSVRC training on Meta-dataset where we don’t use distillation. We sample 600 FSL tasks to report our score.

For our geometric transformations, we sample from a complete space of similarity transformation and use four rotation transformations: {0°, 90°, 180°, 270°}, two scaling transformations: {0.67, 1.0} and three aspect ration transformations: {0.67, 1.0, 1.33}. These geometric transformations can be applied in any combination. For all of our experiments, we set the total number of applied transformations to 16. Additional details and experiments with more geometric transformations are included in the supplementary materials. For the contrastive loss, we use a memory bank that stores 64-dimensional embedding of instances; we sample 6400 negative samples from the memory bank for each mini-batch and set the value of to 1.0.

4.1 Results



Methods Backbone 1-Shot 5-Shot


MAML[pmlr-v70-finn17a]
32-32-32-32
Proto-Net[snell2017prototypical] 64-64-64-64
Relation Net[sung2018learning] 64-96-128-256
Shot-Free[ravichandran2019few] ResNet-12
MetaOptNet[lee2019meta] ResNet-12
Boosting[gidaris2019boosting] WRN-28-10
Fine-tuning[Dhillon2020A] WRN-28-10
LEO-trainval[rusu2018metalearning] WRN-28-10
AWGIM[guo2020attentive] WRN-28-10
DSN-MR[simon2020adaptive] ResNet-12
RFS-Simple[tian2020rethink] ResNet-12
RFS-Distill[tian2020rethink] ResNet-12

Ours
ResNet-12
Ours-Distill ResNet-12


Table 4: Average 5-way few-shot classification accuracy with 95% confidence intervals on tieredImageNet dataset; trained on both train and validation sets. The top two results are shown in red and blue.

We present our results on five popular benchmark FSL datasets in Table 1-5 which demonstrates that even without multi-head distillation our proposed method consistently outperforms the current state-of-the-art (SOTA) FSL methods on both 5-way 1-shot and 5-way 5-shot tasks. By virtue of our novel representation learning approach which retains both the transformation invariant and equivariant features in the learned embeddings, our proposed method improves over the baseline RFS-Simple [tian2020rethink] method across all datasets by 2-5% for both 1-shot and 5-shot tasks. To be more specific, our method outperforms the current best results on CIFAR-FS dataset (Table 1) by 1.3% in the 1-shot task whereas for the 5-shot task it improves the score by 2.8%. However, unlike [Dhillon2020A], which achieves the current best results on the CIFAR-FS 1-shot task, we do not perform any transductive fine-tuning. For FC100 dataset (Table 2) we observe an even bigger improvement; 2.7% and 4.4% for 1 and 5-shot, respectively. We see similar trends in miniImageNet and tieredImageNet (Table 3,4) where we consistently improve over the current SOTA methods by 0.7-2.2%.

For the Meta-Dataset [Triantafillou2020Meta-Dataset:], we train our model on the ILSVRC train split and test on 10 diverse datasets. Our results in Table 5 demonstrate that our method outperforms the fo-Proto-MAML [Triantafillou2020Meta-Dataset:] across all 10 datasets. Even without multi-head distillation, we outperform both simple and distilled version of the RFS method on 6 out of 10 datasets. Overall, we perform favorably well against the RFS, achieving a new SOTA result on the challenging Meta-Dataset.



Dataset fo-Proto-MAML RFS Ours
LR-Simple LR-Distill


ILSVRC
Omniglot
Aircraft
Birds
Textures
Quick Draw
Fungi
VGG Flower
Traffic Signs
MSCOCO
Mean Accuracy


Table 5: Results on Meta-Dataset. Average accuracy (%) is reported with variable number of ways and shots, following the setup in [Triantafillou2020Meta-Dataset:]. 1000 tasks are sampled for evaluation. The top two results are shown in red and blue.

4.2 Ablations



Method miniImageNet, 5-Way CIFAR-FS, 5-Way FC100, 5-Way
1-Shot 5-Shot 1-Shot 5-Shot 1-Shot 5-Shot


Baseline Training
Ours with only Invariance
Ours with only Equivariance
Ours with Equi and Invar (W/O KD)
Ours with Supervised KD
Ours Full


Table 6: Ablation study on miniImageNet, CIFAR-FS, and FC100 datasets.

To study the contribution of different components of our method we do a thorough ablation study on three benchmark FSL datasets: miniImageNet, CIFAR-FS, and FC100 (Table 6). On these three datasets, our baseline supervised training achieves 62.02%, 71.50%, and 42.60% average accuracy on 5-way 1-shot task respectively; which is the same as RFS-Simple [tian2020rethink]. By enforcing invariance we obtain 2.62%, 2%, and 3.5% improvements respectively. Likewise, enforcing equivariance gives 4.07%, 4.87%, and 4.13% improvements over the baseline respectively. On the other hand, we get even bigger improvements by simultaneously optimizing for both equivariance and invariance; achieving 4.8%, 5.33%, and 4.78% improvements on top of the baseline supervised training. Besides, joint training gives 1.3%-3.3% improvement over only invariance training and 0.5%-0.7% improvement in comparison to only equivariance training. We observe similar trends for 5-way 5-shot task. This consistent improvement across all datasets for both tasks empirically validates our claim that joint optimization for both equivariance and invariance is beneficial for FSL tasks. Our ablation study also shows that the multi-head distillation improves the performance over the standard logit-level supervised distillation across all datasets.

Effect of the number of Transformations: To investigate the effect of the total number of applied transformations, we perform an ablation study on the CIFAR-FS validation set by varying the number of transformations, . We present the results in Table 7, which demonstrates that initially, the performance of our method improves with the increasing . However, the performance starts to saturate beyond a particular point. We hypothesize that the performance for an increasing number of transformations decreases since discriminating a higher number of transformations is more difficult and the model spends more representation capability for solving this harder task. A similar trend is observed in [gidaris2018unsupervised], where increasing the number of recognizable rotations does not lead to better performance. Based on Table 7 results, we set the value of to 16 for all of our experiments and do not tune the value from dataset to dataset.



Description 1-Shot 5-Shot


3
Aspect-Ratio
4 Rotation
8 Rotation, Scale
12 Aspect-Ratio, Rotation
16 Aspect-Ratio, Rotation, Scale
20 Aspect-Ratio, Rotation, Scale


Table 7: Ablation Study on CIFAR-FS validation set with different values of . We choose for all the experiments.
Figure 3: t-SNE visualization of features for 1000 randomly sampled images from 5 randomly selected test classes of miniImageNet dataset. In our case, the learned embeddings provide better discrimination for unseen test classes.

4.3 Analysis

We do a t-SNE visualization of the output embeddings from for the test images of miniImageNet to demonstrate the effectiveness of our method (see Fig. 3). We observe that the base learner trained in a supervised manner can retain good class discrimination even for unseen test classes. However, as evident in Fig. 3, the class boundaries are not precise and compact. Enforcing invariance on top of the base learner leads to more compact class boundaries; however, the sample embeddings of different classes are still relatively closer to one another. On the other hand, enforcing equivariance leads to class representations that are well spread out since it retains the transformation equivariant information in the embedding space. Finally, our proposed method takes advantage of both of these complementary properties and generates embeddings that lead to more compact clusters and discriminative class boundaries.

4.4 Alternate Self-Supervision Losses



Method 1-Shot 5-Shot


Baseline Training
Baseline + Jigsaw Puzzle [noroozi2016unsupervised]
Baseline + Location Pred [sun2019unsupervised]
Baseline + Context Pred [doersch2015unsupervised]
Baseline + Rotation [gidaris2018unsupervised]
Ours (W/O KD)


Table 8: FSL with different SSL objectives on miniImageNet dataset.

In Table 8, to further analyze the performance improvement of our method, we conduct a set of experiments where commonly used self-supervised objectives like solving jigsaw puzzles [noroozi2016unsupervised], patch location prediction [sun2019unsupervised], context prediction [doersch2015unsupervised], rotation classification [gidaris2018unsupervised] are added on top of the base learner as an auxiliary task. We found that our proposed method which aims to learn representations that retain both transformation invariant and equivariant information outperforms all of these SSL tasks by a good margin. Besides, we have noticed that the patch-based SSL tasks [noroozi2016unsupervised, sun2019unsupervised, doersch2015unsupervised] generally underperform in comparison to SSL tasks that rely on changing the global statistics of the image while maintaining the local statistics; this conclusion is in line with the experimental results from [gidaris2019boosting].

5 Conclusion

In this work, we explored a set of inductive biases that help us learn highly discriminative and transferable representations for FSL. Specifically, we showed that simultaneously learning equivariant and invariant representations to a set of generic transformations results in retaining a complimentary set of features that work well for novel classes. We also designed a novel multi-head knowledge distillation objective which delivers additional gains. We conducted extensive ablation to empirically validate our claim that joint optimization for invariance and equivariance leads to more generic and transferable features. We obtained new state-of-the-art results on four popular benchmark FSL datasets as well as on the newly proposed challenging Meta-Dataset.

References

Appendix A Supplementary Materials Overview

In the supplementary materials we include the following: additional details about the applied geometric transformations (Section B), additional results with the transformations sampled from the complete space of affine transformations (Section C), ablation study on the coefficient of inductive loss (Section D), ablation study on the temperature of knowledge distillation (Section E), effect of successive self knowledge distillation (Section F), additional results on Meta-Dataset with multi-head distillation (Section G), and effect of enforcing invariance and equivariance for supervised classification (Section H).

Appendix B Geometric Transformations

For our geometric transformations, we sample from a complete space of similarity transformation and use four rotation transformations: {0°, 90°, 180°, 270°}, two scaling transformations: {0.67, 1.0} and three aspect ratio transformations: {0.67, 1.0, 1.33}. Different combinations of these transformations lead to different values of (total number of applied transformations). An ablation study on the value of is included in section 4.2 of the main paper. In Table 9 we include the complete description of different values of that we use in our experiments.



Description


3
AR:{0.67, 1.0, 1.33}
4 ROT:{0°, 90°, 180°, 270°}
8 ROT:{0°, 90°, 180°, 270°}S:{0.67, 1.0}
12 AR:{0.67, 1.0, 1.33}ROT:{0°, 90°, 180°, 270°}

16
(AR:{0.67, 1.0, 1.33}ROT:{0°, 90°, 180°, 270°}) (ROT:{0°, 90°, 180°, 270°}S:{0.67})

20
(AR:{0.67, 1.0, 1.33}ROT:{0°, 90°, 180°, 270°}) (ROT:{0°, 90°, 180°, 270°}S:{0.67}AR:{0.67, 1.33})
24 AR:{0.67, 1.0, 1.33}ROT:{0°, 90°, 180°, 270°}S:{0.67, 1.0}


Table 9: Complete description of different values of based on different combination of aspect ratio (AR), rotation (ROT), and scaling (S) transformations.

Appendix C Additonal Resutls with Affine Transformations

We perform a set of experiments where the objective is to sample geometric transformation from the complete space of affine transformations. To this end, we quantize the affine transformation space according to Table 10. This leads to 972 distinct geometric transformations. Since it’s not feasible to apply all the 972 transformations on an input image to obtain the input tensor , we randomly sample 10 geometric transformations from the set of 972 transformations. We apply these randomly sampled 10 geometric transformations on an input image and generate the input tensor . The results of these experiments are presented in Table 11. From Table 11 it’s evident that training with either invariance or equivariance improves over the baseline training for both 1 and 5 shot tasks (2.5-3.7% improvement). Joint optimization for both invariance and equivariance provides additional improvement of 1%. Even though the experiments with geometric transformations sampled from the complete affine transformation space do not improve over the training with (description of is available in Table 9), the experiments demonstrate consistent improvement when both invariance and equivariance are enforced simultaneously. This provides additional support for our claim that enforcing both invariance and equivariance is beneficial for learning good general representations for solving challenging FSL tasks.



Transformation Quantized Values


Rotation
{0°, 90°, 180°, 270°}
Translation(X) {0.2, 0.0, 0.2 }
Translation(Y) {0.2, 0.0, 0.2}
Scale {0.67, 1.0, 1.33}
Aspect-Ratio {0.67, 1.0, 1.33}
Shear {20°, 0°, 20°}


Table 10: Quantization of the space of Affine transformations.


Method 1-Shot 5-Shot


Baseline Training
Ours with only Invar (affine)
Ours with only Equi (affine)
Ours with Equi and Invar (affine)
Ours with Equi and Invar (=16)


Table 11: Average 5-way few-shot classification accuracy with 95% confidence intervals on miniImageNet dataset; trained with different geometric transformations.

Appendix D Ablation Study for Coefficient of Inductive Loss

We conduct an ablation study to measure the effect of different values of the coefficient of inductive loss (without multi-head distillation) on the CIFAR-FS [bertinetto2018metalearning] validation set; the results of 5-way 1-shot FSL tasks are presented in fig. 4. From fig.4 it is evident that the proposed method is fairly robust to the different values of the coefficient of the inductive loss. However, the best performance is obtained when we set the loss coefficient to 1.0. Based on this ablation study, we use a loss coefficient of 1.0 for the inductive loss in all of our experiments.

Figure 4: Ablation study on CIFAR-FS validation set with different coefficients of the inductive loss (W/O KD); the reported score is average 5-way 1-shot classification accuracy with 95% confidence intervals.

Appendix E Ablation Study for Knowledge Distillation Temperature

To analyse the effect of knowledge distillation temperature (for Kullback Leibler (KL) divergence losses) we conduct an ablation study on the validation set of CIFAR-FS [bertinetto2018metalearning] dataset. From fig. 5 we can observe that the proposed method with multi-head distillation objective is not very sensitive to the temperature coefficient of knowledge distillation. The proposed method achieves similar performance on the CIFAR-FS validation set when the value of distillation temperature is set to 4.0 and 5.0. Based on this ablation study and to be consistent with [tian2020rethink], we set the value of the coefficient of knowledge distillation temperature to 4.0 in all of our experiments.

Figure 5: Ablation study on CIFAR-FS validation set with different values of knowledge distillation temperature; the reported score is average 5-way 1-shot classification accuracy with 95% confidence intervals.

Appendix F Effect of Successive Distillation

In all of our experiments, we use only one stage of multi-head knowledge distillation. To further investigate the effect of knowledge distillation we perform multiple stages of self knowledge distillation on CIFAR-FS [bertinetto2018metalearning] dataset. The results are presented in fig. 6. Here, the 0 distillation stage is the base learner trained with only the supervised baseline loss (), equivariant loss (), and invariant loss (). From fig. 6, we observe that the performance in the FSL task improves for the first 2 stages of distillation, after that the performance saturates. Besides, the improvement from stage 1 to stage 2 is minimal ( 0.1%). Therefore, to make the proposed method more computationally efficient we perform only one stage of distillation in all of our experiments.

Figure 6: Evaluation of different knowledge distillation stages on CIFAR-FS dataset; the reported score is average 5-way 1-shot classification accuracy with 95% confidence intervals.

Appendix G Additional Results on Meta-Dataset

In Table 12, we provide additional results on meta-dataset [Triantafillou2020Meta-Dataset:] which demonstrates that our proposed method can provide additional improvement (0.5%) with the proposed multi-head knowledge distillation objective. We also outperform the distilled version of the current state-of-the-art method RFS [tian2020rethink] by 0.6% in terms of mean accuracy across all the 10 datasets. Considering the challenging nature of Meta-Dataset, this improvement is significant.



Dataset RFS-Distill Ours Ours-Distill


ILSVRC
Omniglot
Aircraft
Birds
Textures
Quick Draw
Fungi
VGG Flower
Traffic Signs
MSCOCO
Mean Accuracy


Table 12: Results on Meta-Dataset. Average accuracy (%) is reported with variable number of ways and shots, following the setup in [Triantafillou2020Meta-Dataset:]. 1000 tasks are sampled for evaluation.

Appendix H Invariance and Equivariance for Supervised Classification

To demonstrate the effectiveness of complementary strengths of invariant and equivariant representations we conduct fully supervised classification experiments on benchmark CIFAR-100 dataset [CIFAR-100]. For these experiments, we use the standard Wide-Resnet-28-10 [BMVC2016_87] architecture as the backbone. For training, we use an SGD optimizer with an initial learning rate of 0.1. We set the momentum to 0.9 and use a weight decay of 5e4. For all the experiments, the training is performed for 200 epochs where the learning rate is decayed by a factor of 5 at epochs 60, 120, and 160. We use a batch size of 128 for all the experiments as well as a dropout rate of 0.3. The training augmentations include standard data augmentations: random crop and random horizontal flip. For enforcing invariance and equivariance, we set the value of to 12 for computational efficiency; description of is available in Table 9. We do not perform knowledge distillation for these experiments. The results of these experiments are presented in Table 13.

From Table 13, we can notice that enforcing invariance provides little improvement (0.2%) over the supervised baseline. This is expected since the train and test data is coming from the same distribution and same set of classes; making the class boundaries compact (for seen classes) doesn’t provide that much additional benefit. However, in the case of FSL we observe that enforcing invariance over baseline provides 2.62%, 2%, and 3.5% improvement for miniImageNet [NIPS2016_6385], CIFAR-FS [bertinetto2018metalearning], and FC100 [oreshkin2018tadam] datasets respectively (section 4.2 of main text). On the other hand, enforcing equivariance for supervised classification provides better improvement (1.8%) since it helps the model to better learn the structure of data. Even though enforcing equivariance provides noticeable improvement for supervised classification, in the case of FSL we obtain a much bigger improvement of 4.07%, 4.87%, and 4.13% for miniImageNet [NIPS2016_6385], CIFAR-FS [bertinetto2018metalearning], and FC100 [oreshkin2018tadam] datasets respectively (section 4.2 of main text). Finally, joint optimization for both invariance and equivariance achieves the best performance and provides minimal but consistent improvement of 0.1% over enforcing only equivariance. However, joint optimization provides a much larger improvement on FSL tasks (see section 4.2 of the main text). From these experiments, we conclude that, although enforcing both invariance and equivariance is beneficial for supervised classification, injecting these inductive biases is more crucial for FSL tasks since the inductive inference for FSL tasks is more challenging (inference on unseen/novel classes).



Method Error Rate (%)



Supervised Baseline
Ours with only Invariance
Ours with only Equivariance
Ours with Equi and Invar (W/O KD)


Table 13: Results with invariance and equivariance for supervised classification on CIFAR-100 dataset.