Bridging Few-Shot Learning and Adaptation: New Challenges of Support-Query Shift

05/25/2021 ∙ by Etienne Bennequin, et al. ∙ 6

Few-Shot Learning (FSL) algorithms have made substantial progress in learning novel concepts with just a handful of labelled data. To classify query instances from novel classes encountered at test-time, they only require a support set composed of a few labelled samples. FSL benchmarks commonly assume that those queries come from the same distribution as instances in the support set. However, in a realistic set-ting, data distribution is plausibly subject to change, a situation referred to as Distribution Shift (DS). The present work addresses the new and challenging problem of Few-Shot Learning under Support/Query Shift (FSQS) i.e., when support and query instances are sampled from related but different distributions. Our contributions are the following. First, we release a testbed for FSQS, including datasets, relevant baselines and a protocol for a rigorous and reproducible evaluation. Second, we observe that well-established FSL algorithms unsurprisingly suffer from a considerable drop in accuracy when facing FSQS, stressing the significance of our study. Finally, we show that transductive algorithms can limit the inopportune effect of DS. In particular, we study both the role of Batch-Normalization and Optimal Transport (OT) in aligning distributions, bridging Unsupervised Domain Adaptation with FSL. This results in a new method that efficiently combines OT with the celebrated Prototypical Networks. We bring compelling experiments demonstrating the advantage of our method. Our work opens an exciting line of research by providing a testbed and strong baselines. Our code is available at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the last few years, we have witnessed outstanding progress in supervised deep learning

[krizhevsky2012imagenet, he2016deep]. As the abundance of labelled data during training is rarely encountered in practice, ground-breaking works in Few-Shot Learning (FSL) have emerged [Vinyals16, Snell17, Finn17], particularly for image classification. This paradigm relies on a straightforward setting. At test-time, given a set of unseen classes during training and few (typically 1 to 5) labelled examples for each one of those classes, the task is to classify query samples among them. We usually call the set of labelled samples the support set, and the set of query samples the query set. Well-adopted FSL benchmarks [Vinyals16, ren2018meta, triantafillou2019meta] commonly sample the support and query sets from the same distribution. We stress that this assumption does not hold in most use cases. When deployed in the real-world, we expect an algorithm to infer on data that may shift, resulting in an acquisition system that deteriorates, lighting conditions that vary, or real world objects evolving [amodei2016concrete].

(a) Standard FSL
(b) FSL under Support / Query Shift
Figure 1: Illustration of the FSQS problem with a 5-way 1-shot classification task sampled from the miniImageNet dataset [Vinyals16]. In (a), a standard FSL setting where support and query sets are sampled from the same distribution. In (b), the same task but with shot-noise and contrast perturbations from [hendrycks2018benchmarking] applied on support and query sets (respectively) that results in a support-query shift. In the latter case, a similarity measure based on the Euclidean metric [Snell17] may become inadequate.

The situation of Distribution Shift (DS) i.e., when training and testing distributions differ, is ubiquitous and has dramatic effects on deep models [hendrycks2018benchmarking], motivating works in Unsupervised Domain Adaptation [pan2009survey], Domain Generalization [gulrajani2021in] or Test-Time Adaptation [wang2020fully]. However, the state of the art brings insufficient knowledge on few-shot learners’ behaviours when facing distribution shift. Some pioneering works demonstrate that advanced FSL algorithms do not handle cross-domain generalization better than more naive approaches [Chen19]. Despite its great practical interest, FSL under distribution shift between the support and query sets is an under-investigated problem and attracts a very recent attention [du2021metanorm]. We refer to it as Few-Shot Learning under Support/Query Shift (FSQS) and provide an illustration in Figure 1. It reflects a more realistic situation where the algorithm is fed with a support set at the time of deployment and infers continuously on data subject to shift. The first solution is to re-acquire a support set that follows the data’s evolution. Nevertheless, it implies human intervention to select and annotate data to update an already deployed model, reacting to a potential drop in performances. The second solution consists in designing an algorithm that is robust to the distribution shift encountered during inference. This is the subject of the present work. Our contributions are:

  1. FewShiftBed: a testbed for FSQS available at The testbed includes 3 challenging benchmarks along with a protocol for fair and rigorous comparison across methods as well as an implementation of relevant baselines, and an interface to facilitate the implementation of new methods.

  2. We conduct extensive experimentation of a representative set of few-shot algorithms. We empirically show that Transductive Batch-Normalization [bronskill2020tasknorm] mitigates an important part of the inopportune effect of FSQS.

  3. We bridge Unsupervised Domain Adaptation (UDA) with FSL to address FSQS. We introduce Transported Prototypes, an efficient transductive algorithm that couples Optimal Transport (OT) [peyre2019computational] with the celebrated Prototypical Networks [Snell17]. The use of OT follows a long-standing history in UDA for aligning representations between distributions [ben2007analysis, ganin2015unsupervised]. Our experiments demonstrate that OT shows a remarkable ability to perform this alignment even with only a few samples to compare distributions and provide a simple but strong baseline.

In Section 2 we provide a formal statement of FSQS, and we position this new problem among existing learning paradigms. In Section 3, we present FewShiftBed. We detail the datasets, the chosen baselines, and a protocol that guarantees a rigorous and reproducible evaluation. In Section 4, we present a method that couples Optimal Transport with Prototypical Networks [Snell17]. In Section 5, we conduct an extensive evaluation of baselines and our proposed method using the testbed. Finally, we present in Section 6 the related works, while in Section 7 we draw perspectives of improvement and interesting research directions.

2 The Support-Query Shift problem

2.1 Statement


We consider an input space , a representation space () and a set of classes . A representation is a learnable function from to and is noted with for a set of parameters. A dataset is a set defined by a set of classes and a set of domains i.e., a domain is a set of IID realizations from a distribution noted . For two domains , the distribution shift is characterized by . For instance, if the data consists of images of letters handwritten by several users, can consist of samples from a specific user. Referring to the well known UDA terminology of source / target [pan2009survey], we define a couple of source-target domains as a couple with , thus presenting a distribution shift. Additionally, given and , the restriction of a domain to images with a label that belongs to is noted .

Figure 2: During meta-learning (Train-Time), each episode contains a support and a query set sampled from different distributions (for instance, illustrated by noise and contrasts as in Figure 1(b)) from a set of training domains (), reflecting a situation that may potentially occurs at test-time. When deployed, the FSL algorithm using a trained backbone is fed with a support set sampled from new classes. As the algorithm is subject to infer continuously on data subject to shift (Test-Time), we evaluate the algorithm on data with an unknown shift (). Importantly, both classes () and shifts () are not seen during training, making the FSQS a challenging problem of generalization.

Dataset splits.

We build a split of , by splitting (respectively ) into and (respectively and ) such that and (respectively and ). This gives us a train/test split with the datasets and . By extension, we build a validation set following this protocol.

Few-Shot Learning under Support-Query Shift (FSQS).


  • and ,

  • a couple of source-target domains from ,

  • a set of classes ;

  • a small labelled support set (named source support set) such that for all , , and such that is an instance of i.e., ;

  • an unlabelled query set (named target query set) such that for all , is an instance of i.e., .

The task is to predict the labels of query set, instances of , in . When and the support set contains labelled instances for each class, this is called an -way -shot FSQS classification task. Note that this paradigm provides an additional challenge compared to classical Few-shot classification tasks, since at test time, the model is expected to generalize to both new classes and new domains while support set and query set are sampled from different distributions. As presented above, in order to evaluate the model’s capacity to generalize to both new classes and new domains, we split the dataset into train, validation and test sets (respectively named , and ) controlling that there is no overlap of both classes and domains of these sets. Model’s parameters are trained on and selected on . Finally, the model is tested on . This paradigm is illustrated in Figure 2.

Episodic training.

We build an episode by sampling some classes , and a source and target domain from . We build a support set of instances from source domain , and a query set of instances from target domain , such that , . Using the labelled examples from and unlabelled instances from , the model is expected to predict the labels of . The parameters of the model are then trained using a cross-entropy loss between the predicted labels and ground truth labels of the query set.

2.2 Positioning

max width= SQ problems Train-Time Test-Time Support Query Support Query New New Size Labels Size Labels Size Labels Transductivity classes domains   No SQS FSL [Snell17, Finn17] Few Few Few Point-wise TransFSL [ren2018meta, liu2018learning, antoniou2019learning] Few Few Few Small CDFSL [Chen19] Few Few Few Point-wise   SQS UDA [quionero2009dataset, pan2009survey] Large Large

TTA [sun2020test, schneider2020improving, wang2020fully] Large Small
ARM [zhang2020adaptive] Large Few Small Ind FSQS Few Few Few Point-wise Trans FSQS Few Few Few Small

Table 1: An overview of SQ problems. We divide SQ problems into two categories, presence or not of Support-Query shift; No SQS vs SQS. We consider three classes of transductivity: point-wise transductivity that is equivalent to inductive inference, small transductivity when inference is performed at batch level (typically in [wang2020fully, zhang2020adaptive]), and large transductivity when inference is performed at dataset level (typically in UDA). New classes (resp. new domains) describe if the model is evaluated at test-time on novel classes (resp. novel domains). Note that we frame UDA as a fully test-time algorithm. Notably, Cross-Domain FSL (CDFSL) [Chen19] assumes that the support set and query set are drawn from the same distribution, thus No SQS.

To highlight FSQS’s novelty, our discussion revolves around the problem of inferring on a given Query Set provided with the knowledge of a Support Set. We refer to this class of problems as SQ problems. Intrinsically, FSL falls into the category of SQ problems. Interestingly, Unsupervised Domain Adaptation [pan2009survey] (UDA), defined as labelling a dataset sampled from a target domain based on labelled data sampled from a source domain, is also a SQ problem. Indeed, in this case, the source domain plays the role of support, while the target domain plays the query’s role. Notably, an essential line of study in UDA leverages the target data distribution for aligning source and target domains, reflecting the importance of transduction in a context of adaptation [ben2007analysis, ganin2015unsupervised] i.e., performing prediction by considering all target samples together. Transductive algorithms also have a special place in FSL [dhillon2019baseline, liu2018learning, ren2018meta] and show that leveraging a query set as a whole brings a significant boost in performances. Nevertheless, UDA and FSL exhibit fundamental differences. UDA addresses the problem of distribution shift using important source data and target data (typically thousands of instances) to align distributions. In contrast, FSL focuses on the difficulty of learning from few samples. To this purpose, we frame UDA as both SQ problem with large transductivity and Support / Query Shift, while Few-Shot Learning is a SQ problem, eventually with small transductivity for transductive FSL. Thus, FSQS combines both challenges: distribution shift and small transductivity. This new perspective allows us to establish fruitful connections with related learning paradigms, presented in Table 1, that we review in the following. A thorough review is available in Appendix A.


We review the UDA works that are the more related to our problem. Inspired from the principle of invariant representations [ben2007analysis, ganin2015unsupervised], the seminal work [courty2016optimal] brings Optimal Transport [peyre2019computational] as an efficient framework for aligning data distributions. UDA requires a whole target dataset for inference, limiting its applications. Recent pioneering work, referred to as Test-Time Adaptation (TTA), adapts at test-time a model provided with a batch of samples from the target distribution. The proposed methodologies are test-time training by self-supervision [sun2020test], updating batch-normalization statistics [schneider2020improving] or parameters [wang2020fully], or meta-learning to condition predictions on the whole batch of test samples for an Adaptative Risk Minimization (ARM) [zhang2020adaptive].

Few-Shot Classification.

We usually frame Few-Shot Classification methods [Chen19] as either metric-based methods [Vinyals16, Snell17], or optimization-based methods that learn to fine-tune by adapting with few gradient steps [Finn17], or hallucination-based methods that augment the rare labelled data [hariharan2017low]. A promising line of study leverages transductivity (using the query set as unlabelled data while inductive methods predict individually on each query sample) and brings Semi-Supervised principles to FSL. Transductive Propagation Network [liu2018learning] meta-learns label propagation from the support to query set concurrently with the feature extractor. Transductive Fine-Tuning [dhillon2019baseline] minimizes the prediction entropy of all query instances during fine-tuning. At last, Ren et al. [ren2018meta]

use the query set to improve the estimation of prototypes. Evaluating cross-domain generalization of FSL (FSCD),

i.e., a distributional shift between meta-training and meta-testing, attracts the attention of a few recent works [Chen19]. In particular, naive approaches perform equally, even better than advanced FSL as shown in [Chen19]. Zhao et al. propose a Domain-Adversarial Prototypical Network [zhao2020domain] in order to both align source and target domains in the feature space while maintaining discriminativeness between classes. Sahoo et al. combine Prototypical Networks with adversarial domain adaptation at the task level [sahoo2019meta]. Notably, Cross-Domain Few-Shot Learning [Chen19] (CDFSL) addresses the distributional shift between meta-training and meta-testing assuming that the support set and query set are drawn from the same distribution, not making it a SQ problem with support-query shift.

3 FewShiftBed: A Pytorch testbed for FSQS

3.1 Datasets

We designed three new image classification datasets adapted to the FSQS problem. These datasets have two specificities.

  1. They are dividable into groups of images, assuming that each group corresponds to a distinct domain. A key challenge is that each group must contain enough images with a sufficient variety of class labels, so that it is possible to sample FSQS episodes.

  2. They are delivered with a train/val/test split, along both the class and the domain axis. This means that (resp. , ) contains images from any domain (resp. , ), that have their labels in (resp. , ), with , and . Therefore, these datasets provide true few-shot tasks at test time, in the sense that the model will not have seen any instances of test classes and domains during training. Furthermore, we split the classes with respect to intuitive semantic similarity, e.g. the training classes are intuitively closer to each other than they are from classes in the testing set. Note that since we split along two axes, some data may be discarded (for instance images from a domain in with a label in ). Therefore it is crucial to find a split that minimizes this loss of data.

Meta-CIFAR100-Corrupted (MC100-C).

CIFAR-100 [krizhevsky2009learning] is a dataset of 60,000 three-channel square images of size , evenly distributed in 100 classes. Classes are evenly distributed in 20 superclasses. We use the same method used to build CIFAR-10-C [hendrycks2018benchmarking], which makes use of 19 image perturbations, each one being applied with 5 different levels of intensity, to evaluate the robustness of a model to domain shift. We modify their protocol to adapt it to the FSQS problem: (i) we split the classes with respect to the superclass structure, and assign 13 superclasses (65 classes) to the training set, 2 superclasses (10 classes) to the validation set, and 5 superclasses (25 classes) to the testing set; (ii) we also split image perturbations (acting as domains), following the split of [zhang2020adaptive]. As a result, when using a model on this benchmark, the model will be trained on images from domains and with labels that are disjoint from both the validation domains and labels, and the testing domains and labels. We obtain 2,184k transformed images for training, 114k for validation and 330k for testing. The detailed split is available in the documentation of our code repository.

miniImageNet-Corrupted (mIN-C).

miniImageNet [Vinyals16]

is a popular benchmark for few-shot image classification. It contains 60k images from 100 classes from the ImageNet dataset. 64 classes are assigned to the training set, 16 to the validation set and 20 to the test set. Like MC100-C, we build mIN-C using the image perturbations proposed by

[hendrycks2018benchmarking] to simulate different domains. We use the original split from [Vinyals16] for classes, and use the same domain split as for MC100-C. Although the original miniImageNet uses images, we use images. This allows us to re-use the perturbation parameters calibrated in [hendrycks2018benchmarking] for ImageNet. Finally, we discard the 5 most time-consuming perturbations. We obtain a total of 1.2M transformed images for training, 182k for validation and 228k for testing. The detailed split in the documentation of our code repository.


EMNIST [cohen2017emnist] is a dataset of images of handwritten digits and uppercase and lowercase characters. Federated-EMNIST [caldas2018leaf] is a version of EMNIST where images are sorted by writer (or user). FEMNIST-FS consists in a split of the FEMNIST dataset adapted to few-shot classification. We separate both users and classes between training, validation and test sets. We build each group as the set of images written by one user. The detailed split is available in the code. Note that in FEMNIST, many users provide several instances for each digits, but less than two instance for most letters. Therefore it is hard to find enough samples from a user to build a support set or a query set. As a result, our experiments are limited to classification tasks with only one sample per class in both the support and query sets.

3.2 Algorithms

We implement in FewShiftBed two representative methods of the vast literature of FSL, that are commonly considered as strong baselines: Prototypical Networks (ProtoNet) [Snell17] and Matching Networks (MatchingNet) [Vinyals16]. Besides, for transductive FSL, we also implement with Transductive Propagation Network (TransPropNet) [liu2018learning] and Transductive Fine-Tuning (FTNet) [dhillon2019baseline]. We also implement our novel algorithm Transported Prototypes (TP) which is detailed in Section 4. FewShiftBed is designed for favoring a straightforward implementation of a new algorithm for FSQS. To add a new algorithm, we only need to implement the set_forward method of the class AbstractMetaLearner. We provide an example with our implementation of the Prototypical Network [Snell17] that only requires few line of codes:

[firstnumber=last]python class ProtoNet(AbstractMetaLearner): def set_forward(self, support_images, support_labels, query_images): z_support, z_query = self.extract_features(support_images, query_images) z_proto = self.get_prototypes(z_support, support_labels) return - euclidean_dist(z_query, z_proto)

3.3 Protocol

To prevent the pitfall of misinterpreting a performance boost, we draw three recommendations to isolate the causes of improvement rigorously.

How important is episodic training?

Episodic training has been, notably through its elegance, responsible for the success and the wide adoption of Meta-Learning for FSL. Nevertheless, in some situation, episodic training does not perform better than more naive approaches [Chen19], including using a backbone trained by standard Empirical Risk Minimization (ERM). Therefore we recommend to report both the result obtained using episodic training and standard ERM (see the documentation of our code repository).

How does the algorithm behave in the absence of Support-Query Shift?

An algorithm that specifically addresses the distribution shift problem should not provide degraded performance in an ordinary context. To certify this unpleasant outcome does not occur, we also recommend reporting the model’s performance when we do not observe, at test-time, a support-query shift. It also provides a top-performing baseline by measuring how far the method is from an ideal case. Note that it is equivalent to evaluate the performance in cross-domain generalization, as firstly described in [Chen19].

Is the algorithm transductive?

The assumption of transductivity has been responsible of several improvements in FSL [antoniou2019learning, ren2018meta, bronskill2020tasknorm] while it has been demonstrated in [bronskill2020tasknorm] that MAML [Finn17] benefits strongly from the Transductive Batch-Normalization (TBN). Thus, we recommend specifying if the method is transductive and adapting the choice of the batch-normalization accordingly (Conventional Batch-Normalization [ioffe2015batch] and Transductive Batch-Normalization for inductive and transductive methods, respectively) since transductive batch normalization brings a significant boost in performance [bronskill2020tasknorm].

4 Transported Prototypes: A baseline for FSQS

Figure 3: Overview of Transported Prototypes. (1) A support set and a query set are fed to a trained backbone that embeds images into a feature space. (2) Due to the shift between distributions, support and query instances are embedded in non-overlapping areas. (3) We compute the Optimal Transport from support instances to query instances to build the transported support set. Note that we represent the transport plan only for one instance per class to preserve clarity in the schema. (4) Provided with the transported support, we apply the Prototypical Network [Snell17] i.e., similarity between transported support and query instances.

4.1 Overall idea

We present a novel method that brings UDA to FSQS. As aforementioned, FSQS presents new challenges since we no longer assume that we sample the support set and the query set from the same distribution. As a result, it is unlikely that the support set and query sets share the same representation space region (non-overlap). In particular, the distance, adopted in the celebrated Prototypical Network [Snell17], may not be relevant for measuring similarity between query and support instances, as presented in Figure 1. We emphasize the discovery of new source and target domain couples at test-time may accentuate the phenomenon of non-overlap significantly. To overcome this issue, we develop a two-phase approach that combines Optimal Transport (Transportation Phase) and the celebrated Prototypical Network (Prototype Phase). We give some background about Optimal Transport (OT) in Section 4.2 and the whole procedure is presented in Algorithm 1.

4.2 Background


We provide some basics about Optimal Transport (OT). A thorough presentation of OT is available at [peyre2019computational]. Let and be two distributions on , we note

the set of joint probability with marginal

and i.e., . The Optimal Transport, associated to cost , between and is defined as:


with any metric. We note

the joint distribution that achieves the minimum in equation

1. It is named the transportation plan from to . When there is no confusion, we simply note . For our applications, we will use as metric the euclidean distance in the representation space obtained from a representation i.e., .

Discrete OT.

When and are only accessible through a finite set of samples, respectively and we introduce the empirical distributions , where () is the mass probability put in sample () i.e., () and is the Dirac distribution in . The discrete version of the OT is derived by introducing the set of couplings where , , and (respectively

) is the unit vector with dim

(respectively ). The discrete transportation plan is then defined as:


where and is the Frobenius dot product. Note that depends on both and , and since depends on . In practice, we use Entropic regularization [cuturi2013sinkhorn] that makes OT easier to solve by promoting smoother transportation plan with a computationally efficient algorithm, based on Sinkhorn-Knopp’s scaling matrix approach [knight2008sinkhorn] (see the Appendix C).

4.3 Method

Input: Support set , query set , classes , backbone .
Output: Loss for a randomly sampled episode.

1:, for Get representations.
2:, for Cost-matrix.
3: Solve Equation 2 Transportation plan.
4:, for Normalization.
5: Given by Equation 3 Get transported support set.
6:, for . Get transported prototypes.
7: From Equation 4, for
8:Return: .
Algorithm 1 Transported Prototypes. Blue lines highlight the OT’s contribution in the computational graph of an episode compared to the standard Prototypical Network [Snell17].

Transportation Phase.

At each episode, we are provided with a source support set and a target query set . We note respectively and their representations from a deep network i.e., is defined as for , respectively is defined as for . As these two sets are sampled from different distributions, and are likely to lie in different regions of the representation space. In order to adapt the source support set to the target domain, which is only represented by the target query set , we follow [courty2016optimal] to compute the barycenter mapping of , that we refer to as the transported support set, defined as follows:


where is the transportation plan from to and . The transported support set is an estimation of labelled examples in the target domain using labelled examples in the source domain. The success relies on the fact that transportation conserves labels, i.e., a query instance close to should share the same label with , where is the barycenter mapping of . See step (3) of Figure 3 for a visualization of the transportation phase.

Prototype Phase.

We compute the mean, called prototype, of the features vectors for each class for the transported support set to obtain the transported prototypes (with the transported support set with class where are classes of current episode). We classify each query with representation using its euclidean distance to each transported prototypes;


Crucially, the standard Prototypical Networks [Snell17] computes euclidean distance to each prototypes while we compute the euclidean to each transported prototypes, as presented in step (4) of Figure 3. Note that our formulation involves the query set in the computation of .

Genericity of OT.

FewShiftBed implements OT as a stand-alone module that can be easily plugged into any FSL algorithm (transportation_module). We report additional baselines in Appendix B where the FSL algorithm is equipped with OT. This technical choice reflects our insight that OT may be ubiquitous for addressing FSQS and makes its usage in the testbed straightforward.

5 Experiments

Meta-CIFAR100-C miniImageNet-C FEMNIST-FS
1-shot 5-shot 1-shot 5-shot 1-shot
ProtoNet [Snell17] 30.02 0.40 42.77 0.47 36.37 0.50 47.58 0.57 84.31 0.73
MatchingNet [Vinyals16] 30.71 0.38 41.15 0.45 35.26 0.50 44.75 0.55 84.25 0.71
TransPropNet [liu2018learning] 34.15 0.39 47.39 0.42 24.10 0.27 27.24 0.33 86.42 0.76
FTNet [dhillon2019baseline] 28.91 0.37 37.28 0.40 39.02 0.46 51.27 0.45 86.13 0.71
TP (ours) 34.00 0.46 49.71 0.47 40.49 0.54 59.85 0.49 93.63 0.63
Table 2:

Top-1 accuracy of few-shot learning models in various datasets and numbers of shots with 8 instances per class in the query set (except for FEMNIST-FS: 1 instance per class in the query set), with 95% confidence intervals. For each setting, the best accuracy is shown in bold.

flags if the method is transductive.

We compare the performance of baseline algorithms with Transported Prototypes on various datasets and settings. We also offer an ablation study in order to isolate the source to the success of Transported Prototypes. Extensive results are detailed in Appendix B. Instructions to reproduce these results can be found in the code’s documentation.

Setting and details.

We conduct experiments on all methods and datasets implemented in FewShiftBed

. Hyperparameters and specific implementation choices are availale in Appendix B. We use a standard 4-layer convolutional network for our experiments on Meta-CIFAR100-C and FEMNIST-FewShot, and a ResNet18 for our experiments on miniImageNet. Transductive methods are equipped with a Transductive Batch-Normalization. All episodic training runs contain 40k episodes, after which we retrieve the “best” state of the model, based on the best validation accuracy. We run each individual experiment on three different random seeds. All results presented in this paper are the average accuracies obtained with these random seeds.


Table 2 reveals that Transported Prototypes (TP) outperform all baselines by a strong margin on all datasets and settings. Importantly, baselines perform poorly on FSQS, demonstrating they are not equipped to address this challenging problem, stressing our study’s significance. It is also interesting to note that transductive approaches, which significantly improve performance in a standard FSL setting [liu2018learning, dhillon2019baseline], perform similarly than simpler methods (Prototypes and Matching Networks)111Notably, TransPropNet [liu2018learning] fails loudly without Transductive Batch-Normalization showing that propagating label with non-overlapping support/query can have a dramatic impact, see Appendix B.. Thus, FSQS deserves a fresher look to be solved. Transported Prototypes mitigate a significant part of the performance drop caused by support-query shift while benefiting from the simplicity of combining a popular FSL method with a time-tested UDA method. This gives us strong hopes for future works in this direction.

Meta-CIFAR100-C miniImageNet-C FEMNIST-FS
1-shot 5-shot 1-shot 5-shot 1-shot
TP 34.00 0.46 49.71 0.47 40.49 0.54 59.85 0.49 93.63 0.63
TP w/o OT 32.47 0.41 48.00 0.44 40.43 0.49 53.71 0.50 90.36 0.58
TP w/o TBN 33.74 0.46 49.18 0.49 37.32 0.55 55.16 0.54 92.31 0.73
TP w. OT-TT 32.81 0.46 48.62 0.48 44.77 0.57 60.46 0.49 94.92 0.55
TP w/o ET 35.94 0.45 48.66 0.46 42.46 0.53 54.67 0.48 94.22 0.70
TP w/o SQS 85.67 0.26 88.52 0.17 64.27 0.39 75.22 0.30 92.65 0.69
Table 3: Top-1 accuracy of our method with various ablations with 8 instances per class in the query set (except for FEMNIST-FS: 1 instance per class in the query set). TP stands for Transported Prototypes, OT denotes Optimal Transport, TBN is Transductive Batch-Normalization, OT-TT refers to the setting where Optimal Transport is applied at test time but not during episodic training, and ET means episodic training i.e., w/o ET refers to the setting where training is performed through standard Empirical Risk Minimization. TP w/o SQS reports model’s performance in the absence of support-query shift.

Ablation study.

Transported Prototypes (TP) combines three components: Optimal Transport (OT), Transductive Batch-Normalization (TBN) and episode training (ET). Which of these components are responsible for the observed gain? Following recommendations from Section 3.3, we ablate those components in Table 3. We observe that both OT and TBN individually improve the performance of ProtoNet for FSQS, and that the best results are obtained when the two of them are combined. Importantly, OT without TBN performs better than TBN without OT (except for 1-shot mIN-C), demonstrating the superiority of OT compared to TBN for aligning distributions in the few samples regime. Note that, the use of TaskNorm [bronskill2020tasknorm] is beyond the scope of the paper222These normalizations are implemented in FewShiftBed for future works.; we encourage future work to dig into that direction and we refer the reader to the very recent work [du2021metanorm]. However, results are mixed. There is no clear evidence that using OT at train-time is better than simply applying it at test-time on a ProtoNet’s backbone i.e., without OT, trained with episodic training (except for 5-shot MC100-C). Additionnally, the value of episodic training compared to standard Empirical Risk Minimization (ERM) is not obvious. For instance, simply using ERM and using Transported Prototypes is better than adding ET on 1-shot MC100-C, 1-shot mIN-C and FEMNIST-FS, making it an another element to add to the study [laenen2020episodes] who put into question the value of ET. Note that Table 2 reports TP results when all components are used, even if some dataset-specific choices can boost performances significantly. Understanding why and when we should use ET or only OT at test-time is interesting for future works. Additionally, we compare Transported Prototypes with MAP [hu2020leveraging]

which implements an OT-based approach for transductive FSL. Their approach includes a power transform to reduce the skew in the distribution, so for fair comparaison we also implemented it into Transported Prototypes for these experiments. Interestingly, our experiments in Table

4 show that MAP is able to handle SQS. For fair comparison, we compare TP with MAP by using the OT module only at test-time for PT and two backbones: ProtoNet [Snell17] and a backbone obtained by ERM. Note that we added in TP the Power-Transform described in [hu2020leveraging] explaining differences with results presented in Table 3.

Meta-CIFAR100-C miniImageNet-C FEMNIST-FS
1-shot 5-shot 1-shot 5-shot 1-shot
TP 36.17 0.47 50.45 0.47 45.41 0.54 57.82 0.48 93.60 0.68
MAP 35.96 0.44 49.55 0.45 43.51 0.47 56.10 0.43 92.86 0.67
TP 32.13 0.45 46.19 0.47 45.77 0.58 59.91 0.48 94.92 0.56
MAP 32.38 0.41 45.96 0.43 43.81 0.47 57.70 0.43 87.15 0.66
Table 4: Top-1 accuracy with 8 instances per class in the query set when applying Transported Prototypes and MAP on two different backbones: is standard ERM (i.e., without Episodic Training) and is ProtoNet [Snell17]. Transported Prototypes performs equally or better than MAP [hu2020leveraging]. Here TP includes power transform in the feature space.

Behaviour in absence of Support-Query Shift.

In order to evaluate the performance drop related to Support-Query Shift compared to a setting with support and query instances sampled from the same distribution, we test Transported Prototypes on few-shot classification tasks without SQS (TP w/o SQS in Table 3), making a setup equivalent to CDFSL. Note that in both cases, the model is trained in an episodic fashion on tasks presenting a Support-Query Shift. These results show that SQS presents a significantly harder challenge than CDFSL, while there is considerable room for improvements.

6 Related Works

This paper makes several contributions. We first define a new problem and we propose algorithms to solve it. We bring to the community the formal statement of FSQS and a testbed including datasets, a protocol and several baselines. Releasing benchmark has always been an important factor for progress in the Machine Learning field, the most outstanding example being ImageNet [deng2009imagenet]

for the Computer Vision community. Recently,

DomainBed [gulrajani2021in] aims to settle Domain Generalization research into a more rigorous process, where FewShiftBed takes inspiration from it. Meta-Dataset [triantafillou2019meta] is an other example, this time specific to FSL. Concerning the novelty of FSQS, we acknowledge the very recent contribution of Du et al. [du2021metanorm] which studies the role of learnable normalization for domain generalization, in particular when support and query sets are sampled from different domains. Note that our statement is more ambitious: we evaluate algorithms on both source and target domains that were unseen during training, while in their setting the source domain has already been seen during training. We also discuss deeply in Section 2.2 the positioning of FSQS with respect to existing learning paradigms. Following [bronskill2020tasknorm], we study the role of Batch-Normalization for SQS, that points out the role of transductivity. Our conviction was that the batch-normalization is the first lever for aligning distributions [schneider2020improving, wang2020fully]. Besides, we bridge the gap between UDA and FSL using Optimal Transport (OT) [peyre2019computational]. OT has a long standing story in UDA [courty2016optimal] and has been recently applied in a context of transductive FSL [hu2020leveraging] while our proposal (TP) is to provide a simple and strong baseline following principle of OT as it is applied in UDA.

7 Conclusion

We release FewShiftBed, a testbed for the under-investigated and crucial problem of Few-Shot Learning when the support and query sets are sampled from related but different distributions, named FSQS. FewShiftBed includes three datasets, relevant baselines and a protocol for reproducible research. Inspired from recent progress of Optimal Transport (OT) to address Unsupervised Domain Adaptation, we propose a method that efficiently combines OT with the celebrated Prototypical Network [Snell17]. Following the protocol of FewShiftBed, we bring compelling experiments demonstrating the advantage of our proposal compared to transductive counterparts. We also isolate factors responsible for improvements. Our findings suggest that Batch-Normalization is ubiquitous, as described in related works [bronskill2020tasknorm, du2021metanorm], while episodic training, even if promising on paper, is questionable. Moving beyond the transductive algorithm, as well as understanding when meta-learning brings a clear advantage, to address FSQS remains an open and exciting problem, where FewShiftBed brings the first step for its progress.


Etienne Bennequin is funded by Sicara and ANRT (France), and Victor Bouvier is funded by Sidetrade and ANRT (France), both through a CIFRE collaboration with CentraleSupélec. This work was performed using HPC resources from the “Mésocentre” computing center of CentraleSupélec and École Normale Supérieure Paris-Saclay supported by CNRS and Région Île-de-France (


Appendix 0.A Extended positioning

Few-Shot Classification.

Methods to solve the Few-Shot Classification problem [lake2011] are usually put into one of these three categories [Chen19]: metric-based, optimization-based, and hallucination-based. Most metric-learning methods are built on the principle of Siamese Networks [Koch15], while also exploiting the meta-learning paradigm: they learn a feature extractor across training tasks [Vinyals16]. Prototypical Networks [Snell17] classify queries from their euclidean distances to one prototype embedding per class. Relation Networks [Sung18] add an other deep network on top of Prototype Networks to replace the euclidean distance. Optimization-based methods use an other approach: learning to fine-tune. MAML [Finn17] and Reptile [nichol2018firstorder] learn a good model initialization, i.e. model parameters that can adapt to a new task (with novel classes) in a small number of gradient steps. Other methods such as Meta-LSTM [Ravi16] and Meta-Networks [Munkhdalai17] replace standard gradient descent by a meta-learned optimizer. Hallucination-based methods aim at augmenting the scarce labeled data, by hallucinating feature vectors [hariharan2017low]

, using Generative Adversarial Networks

[antoniou2017data], or meta-learning [wang2018low]. Recent works also suggest that competitive results in Few-Shot Classification can be achieved with more simple methods based on fine-tuning [Chen19, goldblim2020unraveling].

Transductive Few-Shot Classification.

Some methods aim at solving few-shot classification tasks by using the query set as unlabeled data. Transductive Propagation Network [liu2018learning] meta-learns label propagation from the support to query set concurrently with the feature extractor. Antoniou & Storkey [antoniou2019learning] proposed to use a meta-learned critic network to further adapt a classifier on the query set in an unsupervised setting. Ren et al. [ren2018meta] extend Prototypical Networks in order to use the query set in the prototype computation. Transductive Information Maximization [boudiaf2020transductive]

aims at maximizing the mutual information between the features extracted from the query set and their predicted labels. Finally, Transductive Fine-Tuning

[dhillon2019baseline] augments standard fine-tuning using the classification entropy of all query instances.

Unsupervised Domain Adaptation.

UDA has a long standing story [pan2009survey, quionero2009dataset]. The analysis of the role of representations from [ben2007analysis] has led to wide literature based on domain invariant representations [ganin2015unsupervised, long2015learning]. Outstanding progress have been towards learning more domain transferable representations by looking for domain invariance. The tensorial product between representations and prediction promotes conditional domain invariance [long2018conditional], the use of weights [cao2018partial, you2019universal, bouvier2020robust, combes2020domain] has dramatically improved the problem of label shift theoretically described in [zhang2019bridging], hallucinating consistent target samples [liu2019transferable]

, penalizing high singular values of batch of representations

[chen2019transferability] or by enforcing the favorable inductive bias of consistence through various data augmentation in the target domain [ouali2020target]. Recent works address the problem of adaptation without source data [liang2020we, yehsofa]. The seminal work [courty2016optimal], followed by [courty2017joint, bhushan2018deepjdot], brings Optimal Transport (OT) to UDA by transporting source samples in the target domain.

Test-Time Adaptation.

Test-time Adaptation (TTA) is the subject of recent pioneering works. In [sun2020test], adaptation is performed by test-time training of representations through a self-supervision task which consists in predicting the rotation of an image. This leads to a successful adaptation when the gradient of fine-tuning procedure is correlated with the gradient of the cross-entropy between the prediction and the label of the target sample, which is not available. Inspired from UDA methods based on domain invariance of representations, a line of works [nado2020evaluating, schneider2020improving]

aims to align the mean and the variance of train and test distribution of representations. This is simply done by updating statistics of the batch-normalization layer. In a similar spirit of leveraging the batch-normalization layer for adaptation,


suggests to minimize prediction entropy on a batch of test samples, as suggested in semi-supervised learning

[grandvalet2005semi]. As pointed by authors of [wang2020fully], updating the whole network is inefficient and exposes to a risk of test batch overfit. To adress this problem, authors suggest to only update batch-normalization parameters for minimizing prediction’s entropy. The paradigm of Adaptative Risk Minimization (ARM) is introduced in [zhang2020adaptive]. ARM aims to adapt a classifier at test-time by conditioning its prediction on the whole batch of test samples (not only one sample). Authors demonstrate that such classifier is meta-trainable as long as the training data exposes a structure of group. Consequently, [zhang2020adaptive] is closer work to ours, while we have more ambitious perspectives as we address the problem of few-shot learning i.e., few-shot are available per class while new classes are discovered at test-time.

Few-Shot Classification under Distributional Shift.

Recent works on few-shot classification tackle the problem of distributional shift between the meta-training set and the meta-testing set. Chen et al. [Chen19] compare the performance of state-of-the-art solutions to few-shot classification on a cross-domain setting (meta-training on miniImageNet [Vinyals16] and meta-testing on Caltech-UCSD Birds 200 [WelinderEtal2010]). Zhao et al. propose a Domain-Adversarial Prototypical Network [zhao2020domain] in order to both align source and target domains in the feature space and maintain discriminativeness between classes. Considering the problem as a shift in the distribution of tasks (i.e. training and testing tasks are drawn from two distinct distributions), Sahoo et al. combine Prototypical Networks with adversarial domain adaptation at task level [sahoo2019meta]. While these works address the key issue of distributional shift between meta-training and meta-testing, they assume that for each task, the support set and query set are always drawn from the same distribution. We find that this assumption rarely holds in practice. In this work we consider a distributional shift both between meta-training and meta-testing and between support and query set.

Appendix 0.B All experimental results

In this section we present the extended results of our experiments. Prototypical Networks, Matching Networks and Transductive Propagation Networks have been declined in 10 distinct versions:

  • Original algorithms: episodic training, with Conventional Batch-Normalization (CBN) and not Optimal Transport (Vanilla);

  • Episodic training and CBN, with Optimal Transport applied at test time (OT-TT);

  • Episodic training and CBN, with Optimal Transport integrated into the algorithm both during training and testing (OT);

  • Episodic training, with Transductive Batch-Normalization (TBN) and not Optimal Transport (Vanilla);

  • Episodic training and TBN, with OT-TT;

  • Episodic training and TBN, with OT;

  • Standard Empirical Risk Minimization (ERM) instead of episodic training, with CBN and not Optimal Transport (Vanilla);

  • ERM with CBN and OT;

  • ERM with TBN and no Optimal Transport (Vanilla);

  • ERM with TBN and OT.

Transductive Fine-Tuning (FTNet) is not compatible with episodic training. Also the integration of Optimal Transport into this algorithm is non trivial. Therefore we only applied FTNet with ERM and without OT.

Every result presented in the following tables is the average over three runs with three random seeds (1, 2 and 3). For clarity, we do not report the 95% confidence interval for each result. Keep in mind that this interval is different for each result, but we found that it is always greater than 0.2% and smaller than 0.8%.

Details of the experiments and instructions to reproduce them are available in the code.

Meta-CIFAR100-C 1-shot 8-target
Episodic training Standard ERM
Vanilla w. OT-TT w. OT Vanilla w. OT-TT w. OT Vanilla w. OT Vanilla w. OT
ProtoNet 30.02 32.11 33.74 32.47 32.81 34.00 29.10 35.48 29.79 35.4
MatchingNet 30.71 32.85 34.48 32.97 32.78 35.11 33.50 36.13 33.67 35.87
PropNet 30.26 28.70 26.87 34.15 29.48 27.68 23.33 31.08 22.55 31.20
FTNet 28.91 28.75
Table 5: Ablation for Meta-CIFAR100-C 1-shot 8-target.
Meta-CIFAR100-C 1-shot 16-target
Episodic training Standard ERM
Vanilla w. OT-TT w. OT Vanilla w. OT-TT w. OT Vanilla w. OT Vanilla w. OT
ProtoNet 29.98 32.24 35.63 32.52 31.72 36.20 29.02 35.89 29.61 35.94
MatchingNet 31.1 30.94 35.53 33.08 33.28 36.36 33.49 36.61 33.64 36.54
PropNet 30.82 32.39 31.15 34.83 33.53 31.33 26.81 33.9 27.92 34.10
FTNet 29.01 28.86
Table 6: Ablation for Meta-CIFAR100-C 1-shot 16-target.
Meta-CIFAR100-C 5-shot 8-target
Episodic training Standard ERM
Vanilla w. OT-TT w. OT Vanilla w. OT-TT w. OT Vanilla w. OT Vanilla w. OT
ProtoNet 42.77 47.54 48.37 48.00 48.62 49.71 44.89 48.61 46.59 48.66
MatchingNet 41.15 43.90 44.55 45.05 44.86 45.78 43.00 45.35 43.51 45.10
PropNet 39.13 40.60 25.68 47.39 40.47 27.29 29.32 39.82 29.50 29.82
FTNet 37.28 37.40
Table 7: Ablation of Meta-CIFAR100-C 5-shot 8-target
Meta-CIFAR100-C 5-shot 16-target
Episodic training Standard ERM
Vanilla w. OT-TT w. OT Vanilla w. OT-TT w. OT Vanilla w. OT Vanilla w. OT
ProtoNet 42.07 48.26 48.25 46.49 48.71 49.94 44.67 48.61 46.48 48.89
MatchingNet 41.74 44.51 45.71 44.91 44.71 47.37 42.97 46.06 46.22 46.37
PropNet 38.73 39.25 37.22 43.91 40.62 40.02 33.06 40.03 33.93 40.03
FTNet 37.51 37.66
Table 8: Ablation of Meta-CIFAR100-C 5-shot 16-target
FEMNIST-FewShot 1-shot 1-target
Episodic training Standard ERM
Vanilla w. OT-TT w. OT Vanilla w. OT-TT w. OT Vanilla w. OT Vanilla w. OT
ProtoNet 84.31 94.00 92.31 90.36 94.92 93.63 80.20 94.30 86.22 94.22
MatchingNet 84.25 93.66 92.73 91.05 95.37 93.62 85.04 94.34 87.19 94.26
PropNet 31.30 40.60 79.30 86.42 93.08 87.52 45.36 73.64 47.34 79.50
FTNet 86.13 85.92
Table 9: Ablation of FEMNIST-FewShot 1-shot 1-target.
Meta-CIFAR100-C miniImageNet-C FEMNIST-FS
1-shot 5-shot 1-shot 5-shot 1-shot
MAP 36.58 49.37 43.38 56.25 92.94
TP (ours) 36.51 50.60 45.38 61.46 93.63
Table 10: Top-1 accuracy of MAP [hu2020leveraging] compared to Transported Prototypes (ours). Both methods incorporate Optimal Transport into Few-Shot Learning. MAP [hu2020leveraging] is originally designed for standard transductive FSL. Interestingly, MAP and TP perform quite similarly demonstrating that OT is a powerful tool for addressing FSQS. Note that MAP leverages a Power-Transform that we also plug in TP for comparison, resulting in a boost of performance. Understanding which learners operate best with Optimal Transport is an exciting question. In particular, by proposing TP, we have shown that we result in a strong, interpretable and theoretically motivated method by following principles when applying OT in UDA.

Appendix 0.C Training details

Entropic regularization for Optimal Transport was proposed in [cuturi2013sinkhorn] and makes OT easier to solve. It is defined as with and is the negative entropy. It promotes smoother transportation plan while allowing to derive a computationally efficient algorithm, based on Sinkhorn-Knopp’s scaling matrix approach [knight2008sinkhorn]. In our experiment, we set , but it is possible to tune it, eventually meta-learning it.