1 Introduction
Humans can learn to identify new categories from few examples, even from a single one (Carey and Bartlett, 1978). Fewshot learning has recently attracted significant attention (Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Santoro et al., 2016; Munkhdalai et al., 2018; Mishra et al., 2018), as it aims to produce models that can generalize from small amounts of labeled data. In the fewshot setting, one aims to learn a model that extracts information from a set of support examples (sample set) to predict the labels of instances from a query set. Recently, this problem has been reframed into the metalearning framework Ravi and Larochelle (2016), i.e. the model is trained so that given a sample
set or task, produces a classifier for that specific task. Thus, the model is exposed to different tasks (or episodes) during the training phase, and it is evaluated on a nonoverlapping set of new tasks
(Vinyals et al., 2016).Two recent approaches have attracted significant attention in the fewshot learning domain: Matching Networks (Vinyals et al., 2016), and Prototypical Networks (Snell et al., 2017). In both approaches, the sample set and the query
set are embedded with a neural network, and nearest neighbor classification is used given a metric in the embedded space. Since then, the problem of learning the most suitable metric for fewshot learning has been of interest to the field
(Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Munkhdalai et al., 2018; Mishra et al., 2018). Learning a metric space in the context of fewshot learning generally implies identifying a suitable similarity measure (e.g. cosine or Euclidean), a feature extractor mapping raw inputs onto similarity space (e.g. convolutional stack for images or LSTM stack for text), a cost function to drive the parameter updates, and a training scheme (often episodic). Although the individual components in this list have been explored, the relationships between them have not received considerable attention.In the current work we aim to close this gap. We show that taking into account the interaction between the identified components leads to significant improvements in the fewshot generalization. In particular, we show that a nontrivial interaction between the similarity metric and the cost function can be exploited to improve the performance of a given similarity metric via scaling. Using this mechanism we close more than the 10% gap in performance between the cosine similarity and the Euclidean distance reported in
(Snell et al., 2017). Even more importantly, we extend the very notion of the metric space by making it task dependent via conditioning the feature extractor on the specific task. However, learning such a space is in general more challenging than learning a static one. Hence, we find a solution in exploiting the interaction between the conditioned feature extractor and the training procedure based on auxiliary cotraining on a simpler task. Our proposed fewshot learning architecture based on taskdependent scaled metric achieves superior performance on two challenging fewshot image classification datasets. It shows up to 8.5% absolute accuracy improvement over the baseline (Snell et al. (2017)), and 4.8% over the stateoftheart (Munkhdalai et al., 2018) on the 5shot, 5way miniImagenet classification task, reaching 76.7% of accuracy, which is the bestreported accuracy on this dataset.1.1 Background
We consider the episodic shot, way classification scenario. In this scenario, a learning algorithm is provided with a sample set consisting of examples for each of classes and a query set for a task to be solved within a given episode. The sample set provides the task information via observations and their respective class labels . Given the information in the sample set , the learning algorithm is able to classify individual samples from the query set . Next, we define a similarity measure . Note that does not have to satisfy the classical metric properties (nonnegativity, symmetry, subadditivity) to be useful in the context of fewshot learning. The dimensionality of metric input,
, will most naturally be related to the size of embedding created by a (deep) feature extractor
, parameterized by , mapping to . Here is a list of parameters defining , e.g. a list of weights in a neural network. The set of representations can directly be used to solve the fewshot learning classification problem by association. For example, Matching networks (Vinyals et al., 2016) use samplewise attention mechanism to perform kernel label regression. Instead, Snell et al. (2017) defined a feature representation for each class as the mean over embeddings belonging to : . To learn , they minimize using the softmax over prototypes to define the likelihood: .1.2 Summary of contributions
Metric Scaling: To our knowledge, this is the first study to (i) propose metric scaling to improve performance of fewshot algorithms, (ii) mathematically analyze its effects on objective function updates and (iii) empirically demonstrate its positive effects on fewshot performance.
Task Conditioning: We use a task encoding network to extract a task representation based on the task’s sample set. This is used to influence the behavior of the feature extractor through FILM (Perez et al., 2018).
Auxiliary task cotraining:
We show that cotraining the feature extraction on a conventional supervised classification task reduces training complexity and provides better generalization.
1.3 Related work
Three main approaches for solving the fewshot classification problem can be identified in the literature. The first one, which is used in this work, is the metalearning approach, i.e. learning a model that, given a task (set of labeled data), produces a classifier that generalizes across all tasks (Thrun, 1998; Schmidhuber et al., 1997). This is the case of Matching Networks (Vinyals et al., 2016)
, which optionally use a Recurrent Neural Network (RNN) to accumulate information about a given task. In MAML
(Finn et al., 2017), the parameters of an arbitrary learner model are optimized so that they can be quickly adapted to a particular task. In “Optimization as a model” (Ravi and Larochelle, 2016), a learner model is adapted to a new episodic task by a recurrent metalearner producing efficient parameter updates. A more general approach was proposed by Santoro et al. (2016), where the metalearner is trained to represent entries from a sample set in an external memory. Similarly, adaResNet (Munkhdalai et al., 2018) uses memory and the sampleset to produce shift coefficients on the neuron activations of the
query set classifier. Many recent approaches focus on learning a metric on the episodic feature space. Prototypical networks (Snell et al., 2017)use a feedforward neural network to embed the task examples and perform nearest neighbor classification with the class centroids. The relation network approach by
Sung et al. (2018) introduces a separate learnable similarity metric. SNAIL (Mishra et al., 2018)uses an explicit attention mechanism applicable both to supervised and to the sequence based reinforcement learning tasks. It has also been shown that these approaches benefit from leveraging unlabeled and simulated data
(Ren et al., 2018; Wang et al., 2018).A second approach aims to maximize the distance between examples from different classes (Koch et al., 2015). Similarly, in (Hadsell et al., 2006)
, a contrastive loss function is used to learn to project data onto a manifold that is invariant to deformations in the input space. In the same vein, in
(Fink, 2005; Schroff et al., 2015; Taigman et al., 2015), triplet loss is used for learning a representation for fewshot learning. The attentive recurrent comparators (Shyam et al., 2017) go beyond classical siamese approaches and use a recurrent architecture to learn to perform pairwise comparisons and predict if the compared examples belong to the same class.The third approach relies on Bayesian modeling of the prior distribution of the different categories like in Li et al. (2006); Bauer et al. (2017), or Lake et al. (2013); Edwards and Storkey (2016); Lacoste et al. (2017) who rely on hierarchical Bayesian modeling.
As for task conditioning, Dumoulin et al. (2017); Perez et al. (2017, 2018)
proposed conditional batch normalization for style transfer and visual reasoning. Differently, we modify the conditioning scheme to adapt it to fewshot learning, introducing
priors, and auxiliary cotraining. In the fewshot learning context, task conditioning ideas can be traced back to (Vinyals et al., 2016), although in an implicit form as there is no notion of task embedding. In our work, we explicitly introduce a task representation (see Fig. 1) computed as the mean of the task class centroids (task prototypes). This is much simpler than individual sample level LSTM/attention models in
(Vinyals et al., 2016). Conditioning in (Vinyals et al., 2016) is applied as a postprocessing of the output of a fixed feature extractor. We propose to condition the feature extractor by predicting its own batch normalization parameters thus making feature extractor behaviour taskdynamic without cumbersome finetuning on support set. In order to train the task conditioned architecture we use multitask training with a usual 64way classification task. Even though auxiliary cotraining is beneficial for learning in general, “little is known on when multitask learning works and whether there are data characteristics that help to determine its success” (Plank and Alonso, 2017). We show that combining task conditioning and auxiliary cotraining is beneficial in the context of fewshot learning.The scaling and temperature adjustment in the softmax was discussed by Hinton et al. (2015) in the context of model distillation. We propose to use it in the context of the fewshot learning scenario and provide novel theoretical and empirical results quantifying the effects of scaling parameter.
The rest of the paper is organized as follows. Section 2 describes our contributions in detail. Section 3 highlights the importance of each contribution via an ablation study. The study is performed over two different benchmarks in the regime of 1shot, 5shot and 10shot learning to verify if conclusions hold across different setups. Finally, Section 4 concludes the paper and outlines future research directions.
2 Model Description
2.1 Metric Scaling
Snell et al. (2017) using approach described in detail in Section 1.1 found that the Euclidean distance outperformed the cosine distance used in Vinyals et al. (2016). We hypothesize that the improvement could be directly attributed to the interaction of the different scaling of the metrics with the softmax. Moreover, the dimensionality of the output is known to have a direct impact on the output scale even for the Euclidean distance (Vaswani et al., 2017). Hence, we propose to scale the distance metric by a learnable temperature, , , to enable the model to learn the best regime for each similarity metric, thus improving the performances of all metrics. To further understand the role of , we analyze the classwise crossentropy loss function, ,^{1}^{1}1Note that the total loss is simply
(1) 
where is the query set corresponding to the class . Its gradient, which is used to update parameters is given by the following expression:
(2) 
At first glance, the effect of on the expression of the derivative is twofold: (i) an overall scaling, and (ii) regulating the sharpness of weighting in the second term inside the brackets on the RHS. Below we explore the behavior of the normalized^{2}^{2}2The effect of related gradient scaling is trivial. gradient in the limits and .
Lemma 1 (Metric scaling).
If the following assumptions hold:
then it is true that:
(3)  
(4) 
where .
Proof.
Please refer to Appendix A. ∎
From Eq. (3), it is clear that for small values, the first term minimizes the embedding distance between query samples and their corresponding prototypes. The second term maximizes the embedding distance between the samples and the prototypes of the nonbelonging categories. For large values (Eq. (4)), the first term is the same as in Eq. (3); while the second term maximizes the distance of the sample with the closest wrongly assigned prototype (if any). If (no error), the derivative contribution of the point is zero. This is equivalent to learning only from the hardest examples resulting in association errors. Thus, the two different regimes of favor either minimizing the overlap of the sample distributions or correcting cluster assignments samplewise.
The large regime is more directly related to resolving the fewshot classification errors. At the same time, the update strategy generated in this regime has a drawback. As the optimization proceeds and the classification accuracy increases, the number of incorrectly classified samples reduces on average, and this leads to the reduction in the average effective batch size (more samples generate zero derivatives). Therefore, our hypothesis is that there is an optimal value of scaling parameter for a given combination of dataset, metric and task. Section 3.4 empirically demonstrates that the optimal value of indeed exists and it can be e.g. crossvalidated on a validation set.
2.2 Task conditioning
Up until now we assumed the feature extractor to be taskindependent. A dynamic taskconditioned feature extractor should be better suited for finding correct associations between given sample set class representations and query samples, this is implicitly done by Vinyals et al. (2016) with a bidirectional LSTM as a postprocessing of a fixed feature extractor. Differently, we explicitly define a dynamic feature extractor , where is the set of parameters predicted from a task representation such that the performance of is optimized given the task sample set . This is related to the FILM conditioning layer (Perez et al., 2018) and conditional batch normalization (Dumoulin et al., 2017; Perez et al., 2017) of the form , where and
are scaling and shift vectors applied to the layer
. Concretely, we propose to use the mean of the class prototypes as the task representation, , encode it with a task embedding network (TEN), and predict layerlevel elementwise scale and shift vectors for each convolutional layer in the feature extractor (see Figures 1 and 2 in the Supplementary Materials, Section S1). The task representation defined as the mean of task class centroids (i) reduces the dimensionality of the TEN input and (ii) replaces expensive RNN/CNN/attention modeling. On the other hand, it is an effective way to cluster tasks. Tasks having larger number of similar classes in common will tend to cluster closer in the task representation space.Our implementation of the TEN (see Supplementary Materials, Section S1 for more details) uses two separate fully connected residual networks to generate vectors . Following the terminology in (Perez et al., 2017), the parameter is learned in the delta regime, i.e. predicting deviation from unity. The most critical component in being able to successfully train the TEN was the addition of the scalar penalized postmultipliers and . They limit the effect of (and ) by encoding a prior belief that all components of (and ) should be simultaneously close to zero for a given layer unless task conditioning provides a significant information gain for this layer. Mathematically, this can be expressed as and , where and are predictors of and .
2.3 Architecture
The overall proposed fewshot classification architecture is depicted in Fig. 1 (see Supplementary Materials, Section S1 for more details). We employ ResNet12 (He et al., 2016)
as the backbone feature extractor. It has 4 blocks of depth 3 with 3x3 kernels and shortcut connections. 2x2 maxpool is applied at the end of each block. Convolutional layer depth starts with 64 filters and is doubled after every maxpool. Note that this architecture is similar in spirit to architectures used in
(Bauer et al., 2017) and (Munkhdalai et al., 2018), but we do not use any projection layers before or after the main backbone ResNet. On the first pass over sample set, the TEN predicts the values of and parameters for each convolutional layer in the feature extractor from the task representation. Next, the sample set and the query set are processed by the feature extractor conditioned with the values of and just generated. Both outputs are fed into a similarity metric to find an association between class prototypes and query instances. The output of similarity metric is scaled by scalarand is fed into a softmax layer.
1shot  5shot  10shot  

Meta Nets (Ravi and Larochelle, 2016)  43.4  60.6   
Matching Networks (Vinyals et al., 2016)  46.6  60.0   
MAML (Finn et al., 2017)  48.7  63.1   
Proto Nets (Snell et al., 2017)  49.4  68.2  
Relation Net (Sung et al., 2018)  50.4  65.3   
SNAIL (Mishra et al., 2018)  55.7  68.9   
Discriminative kshot (Bauer et al., 2017)  56.3  73.9  78.5 
adaResNet (Munkhdalai et al., 2018)  56.9  71.9   
Ours  58.5  76.7  80.8 
2.4 Auxiliary task cotraining
The TEN (Section 2.2
) introduces additional complexity into the architecture via task conditioning layers inserted after the convolutional and batch norm blocks. We empirically observed that simultaneously optimizing convolutional filters and the TEN is overly challenging. We solved the problem by auxiliary cotraining with an additional logit head (the normal 64way classification in miniImagenet case). The auxiliary task is sampled with a probability that is annealed over episodes. We annealed it using an exponential decay schedule of the form
, where is the total number of training episodes, is episode index. The initial auxiliary task selection probability was crossvalidated to be and the number of decay steps was chosen to be 20. We observed significant positive effects from the auxiliary task cotraining (please refer to Section 3.4). The same positive effects were not observed with simple pretraining of the feature extractor. We attribute this to the regularization effects achieved via backpropagating auxiliary task gradients together with those of the main task.It is of interest to note that the fewshot cotraining with an auxiliary classification task is related to curriculum learning (Santoro et al., 2016). The auxiliary classification problem could be considered a part of a simpler curriculum that helps the learner acquire minimal skill level necessary before tackling on harder fewshot classification tasks. Being effective at feature extraction (i.e. at task representation) forms a “prerequisite” at being effective at reconditioning features based on the representation of a given task.
3 Experimental Results
Table 1 presents our key result in the context of existing stateofthe art. The five first rows show approaches that use the same feature extractor as (Vinyals et al., 2016), i.e. four stacked convolutions layers of 64 filters (32 in Ravi and Larochelle (2016); Finn et al. (2017) to avoid overfitting). In the following rows we include models like the one we propose, which is based on resnet He et al. (2016). Concretely, SNAIL Mishra et al. (2018), adaResNet Munkhdalai et al. (2018), and our architecture use four residual blocks of three stacked convolutional layers, each block followed by max pooling. Differently, the feature extractor proposed in Bauer et al. (2017) is based on a ResNet34 architecture with a reduced number of features.
As it can be seen, the proposed algorithm significantly improves over the existing stateoftheart results on the miniImagenet dataset. In the rest of the section we address the following research questions: (i) can metric scaling improve fewshot classification results? (Sections 3.2 and 3.4), (ii) what are the contributions of each components of our proposed architecture? (Section 3.4), (iii) can task conditioning improve fewshot classification results and how important it is at different feature extractor depths? (Sections 3.3 and 3.4), and (iv) can auxiliary classification task cotraining improve accuracy on the fewshot classification task? (Section 3.4).
3.1 Experimental setup and datasets
The details of the experimental and training setup are provided in Supplementary Materials, Section S3. Note that we focused on miniImagenet Vinyals et al. (2016) and FewshotCIFAR100 (introduced below) instead of Omniglot Lake et al. (2015); Vinyals et al. (2016); Snell et al. (2017) as the former ones are more challenging, and the error rate is more sensitive to model improvements.
miniImagenet. The miniImagenet dataset was proposed by Vinyals et al. (2016). It has 100 classes, with 600 images per class. Each task is generated by sampling 5 classes uniformly and 5 training samples per class, the remaining images from the 5 classes are used as query images to compute accuracy. To perform metavalidation and metatest on unseen tasks (and classes), we isolate 16 and 20 classes from the original set of 100, leaving 64 classes for the training tasks. We use exactly the same train/validation/test split as the one suggested by Ravi and Larochelle (2016).
FewshotCIFAR100. We introduce a new image based dataset based on CIFAR100 (Krizhevsky, 2009) for fewshot learning. We will refer to it as FC100. The main motivation for introducing this new dataset is to validate that the main results appearing in the experimental section generalize well beyond the miniImagenet. The secondary motivation is that the FC100 is suited for faster fewshot scenario prototyping than the miniImagenet and it presents a more challenging fewshot learning problem, because of reduced image size. On top of that, we propose a class split in FC100 to minimize the information overlap between splits to make it significantly more challenging than e.g. Omniglot. The original CIFAR100 dataset consists of color images belonging to 100 different classes, 600 images per class. The 100 classes are further grouped into 20 superclasses. We split the dataset by superclass, rather than by individual class to minimize the information overlap. Thus the train split contains 60 classes belonging to 12 superclasses, the validation and test contain 20 classes belonging to 5 superclasses each. The exact class split is provided in Supplementary Materials, Section S2. The tasks are sampled uniformly at random within train, validation and test subsets. Therefore, each task with high probability contains samples belonging to classes from several superclasses.
3.2 On the similarity metric
miniImagenet  FC100  

5way train  20way train  5way train  20way train  
Proto Nets Snell et al. (2017)  65.8 0.7  68.2 0.7  N/A  N/A 
Proto Nets  67.7 0.2  68.9 0.3  51.1 0.2  50.3 0.3 
Prototypical Cosine  54.5 1.1  53.9 0.6  40.9 0.6  37.1 1.9 
Prototypical Cosine Scaled  68.2 0.8  68.1 0.7  51.0 0.6  49.6 0.5 
Average classification accuracy in percent with 95% confidence interval. 5shot, 5 way classification task. The three last rows correspond to our implementation, first with euclidean distance, second with cosine distance, and third with the scaled cosine distance.
We reimplemented prototypical networks (Snell et al., 2017), and use the Euclidean and the cosine similarity to test the effects of scaling (see Section 2). We closely follow the experimental setup defined by Snell et al. (2017) (same feature extractor and training procedure). The scaling parameter used on the last row was crossvalidated on the validation set. Results are presented in Table 2.
As it can be seen in row two of Table 2, our reimplementation of Proto Nets (Snell et al., 2017) obtained slightly better performance (68.9% and 67.7%) in 20way and 5way training scenarios respectively by increasing the number of training steps from 20K to 40K^{3}^{3}3With 20K steps it was possible to recover the exact original performance reported in Snell et al. (2017), which is not included in Table 2 for the sake of brevity..
Importantly, we confirm the hypothesis that the improvement attributed to the Euclidean distance in Snell et al. (2017) was due to a scaling effect. Namely, we show that the scaled cosine similarity matches very closely the performance of the Euclidean metric, with an improvement of 14 percentage points on the miniImagenet (similar results on FC100) over the nonscaled version. In order to control for the potential effect that the scaling parameter may have on the learning rate as indicated by Equation (2) training was performed using multiple initial learning rates (covering the range between 0.0005 and 0.01), obtaining similar accuracy each time. Hereinafter, we report the results with the Euclidean metric for brevity, since the cosine produces similar results. Moreover, since the prototypical approach with Euclidean distance as well as with the scaled cosine are close and both are superior to Vinyals et al. (2016), we base our results on Snell et al. (2017).
3.3 TEN importance across layers
We hypothesized in Section 2.2 that the TEN conditioning should not be equally important at all depths. Fig. 2 depicts the boxplot of the empirical observations of the learned TEN postmultipliers^{4}^{4}4Larger absolute values of and imply a larger influence of their respective TEN layers and at different depths of the feature extractor. We can see that for the multiplier , the absolute value of its scale tends to increase as we approach the softmax layer. Interestingly, peaks can be observed every 3 layers (layers 3, 6, 9, 12). The peaks correspond to the location of the convolutional layers preceding the maxpool layers. For the bias parameter , the only layer having a large absolute value of its scale is the last layer, before the softmax. We attribute the observed pattern to the fact that the shallower layers in the feature extractor tend to be less taskspecific than the deeper layers. Following this intuition, we performed experiments in which we (i) kept the TEN injection solely in layers preceding the max pool and (ii) kept the TEN injection only in the very last layer. Interestingly, we saw that TEN layers with small weight still provide some positive contribution, although most of the contribution is indeed provided by the layers preceding the max pool operation.
3.4 Ablation study
miniImagenet  FC100  
AT  TC  1shot  5shot  10shot  1shot  5shot  10shot  
56.5 0.4  74.2 0.2  78.6 0.4  37.8 0.4  53.3 0.5  58.7 0.4  
✓  56.8 0.3  75.7 0.2  79.6 0.4  38.0 0.3  54.0 0.5  59.8 0.3  
✓  ✓  58.0 0.3  75.6 0.4  80.0 0.3  39.0 0.4  54.7 0.5  60.4 0.4  
✓  ✓  54.4 0.3  74.6 0.3  78.7 0.4  37.8 0.2  54.0 0.7  58.8 0.3  
✓  ✓  ✓  58.5 0.3  76.7 0.3  80.8 0.3  40.1 0.4  56.1 0.4  61.6 0.5 
In this section, we study the impact in generalization accuracy of the scaling, task conditioning, auxiliary cotraining, and the feature extractor. Results are summarized in Table 3.
First, we validated the hypothesis that there is an optimal value of the metric scaling parameter () for a given combination of dataset and metric, which is reflected in the inverse Ushape of the curves in Fig. 3.
Second, we studied the effects of the task conditioning described in Section 2.2. No improvement was observed for the taskconditioned ResNet12 without auxiliary cotraining (see Table 3). We observed that learning useful features for the TEN and the main feature extractor at the same time is hard and gets stuck in local extrema. The problem is solved by cotraining on the auxiliary task of predicting Imagenet labels using an additional fullyconnected layer with softmax, see Section 2.4. In effect, we observed that auxiliary cotraining provides two benefits: (i) making the initial convergence easier, and (ii) providing regularization on the fewshot learning task by forcing the feature extractor to perform well on two decoupled tasks. The latter benefit can only be observed when the feature extraction unit is sufficiently decoupled on the main task and the auxiliary task via the use of TEN (the feature extractor output is additionally adjusted on the target task using FILM).
4 Conclusions and Future Work
We proposed, analyzed, and empirically validated several improvements in the domain of fewshot learning. We showed that the scaled cosine similarity performs at par with Euclidean distance, unlike its unscaled counterpart. In fact, based on our results, we argue that the scaling factor is a necessary standard component of any fewshot learning algorithm relying on a similarity metric and the crossentropy loss function. This is especially important in the context of finding new more effective similarity measures for fewshot learning. Moreover, our theoretical analysis demonstrated that simply scaling the similarity metric results in completely different regimes of parameter updates when using softmax and categorical crossentropy. We also identified that the optimal performance is achieved in between two asymptotic regimes of the softmax. This poses the research question of explicitly designing loss functions and the schedules optimal for fewshot learning. We further proposed task representation conditioning as a way to improve the performance of a feature extractor on the fewshot classification task. In this context, designing more powerful task representations, for example, based on higher order statistics of class embeddings, looks like a very promising venue for future work. The experimental results obtained on two independent challenging datasets demonstrated that the proposed approach significantly improves over existing results and achieves stateoftheart on fewshot image classification task.
Appendix
Appendix A Proof of Lemma 1
First, consider the case . Denoting we have:
Second, consider the case :
It is obvious that whenever at least one of the exponential terms in the denominator in the expression above has positive rate, corresponding to the case , the ratio converges to zero as under assumption . The only case when the limit is nonzero is when is the prototype closest to the query point . If we define the index of this prototype as , then the following holds: , leading (under additional assumption ) to:
Therefore, (4) follows. ∎
Acknowledgements
Authors acknowledge the support of the Spanish project TIN201565464R (MINECO/FEDER), the 2016FI B 01163 grant of Generalitat de Catalunya. Authors would like to thank Nicolas Chapados, Adam Salvail and Rachel Samson as well as anonymous reviewers for their careful reading of the manuscript and for providing constructive feedback and valuable suggestions.
References
 Bauer et al. (2017) M. Bauer, M. RojasCarulla, J. B. Świątkowski, B. Schölkopf, and R. E. Turner. Discriminative kshot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
 Carey and Bartlett (1978) S. Carey and E. Bartlett. Acquiring a single new word. 1978.
 Dumoulin et al. (2017) V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. ICLR, 2017.
 Edwards and Storkey (2016) H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
 Fink (2005) M. Fink. Object classification from a single example utilizing class relevance metrics. In NIPS, pages 449–456, 2005.
 Finn et al. (2017) C. Finn, P. Abbeel, and S. Levine. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, pages 1126–1135, 2017.
 Hadsell et al. (2006) R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, volume 2, pages 1735–1742. IEEE, 2006.
 He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, pages 770–778, 2016.
 Hinton et al. (2015) G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. URL http://arxiv.org/abs/1503.02531.

Koch et al. (2015)
G. Koch, R. Zemel, and R. Salakhutdinov.
Siamese neural networks for oneshot image recognition.
In
ICML Deep Learning Workshop
, volume 2, 2015.  Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images. , University of Toronto, 2009.
 Lacoste et al. (2017) A. Lacoste, T. Boquet, N. Rostamzadeh, B. Oreshkin, W. Chung, and D. Krueger. Deep prior. arXiv preprint arXiv:1712.05016, 2017.
 Lake et al. (2013) B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum. Oneshot learning by inverting a compositional causal process. In NIPS, pages 2526–2534, 2013.
 Lake et al. (2015) B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 Li et al. (2006) F.F. Li, R. Fergus, and P. Perona. Oneshot learning of object categories. PAMI, 28(4):594–611, 2006.
 Mishra et al. (2018) N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive metalearner. In ICLR, 2018.
 Munkhdalai et al. (2018) T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler. Rapid adaptation with conditionally shifted neurons. In ICML, 2018.
 Perez et al. (2017) E. Perez, H. de Vries, F. Strub, V. Dumoulin, and A. C. Courville. Learning visual reasoning without strong priors. CoRR, abs/1707.03017, 2017.
 Perez et al. (2018) E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
 Plank and Alonso (2017) B. Plank and H. M. Alonso. When is multitask learning effective? Semantic sequence prediction under varying data conditions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, pages 44–53, 2017.

Ramachandran et al. (2018)
P. Ramachandran, B. Zoph, and Q. V. Lea.
Searching for activation functions.
In ICLR, 2018.  Ravi and Larochelle (2016) S. Ravi and H. Larochelle. Optimization as a model for fewshot learning. In ICLR, 2016.
 Ren et al. (2018) M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel. Metalearning for semisupervised fewshot classification. arXiv preprint arXiv:1803.00676, 2018.

Santoro et al. (2016)
A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap.
Metalearning with memoryaugmented neural networks.
In M. F. Balcan and K. Q. Weinberger, editors, ICML, volume 48
of
Proceedings of Machine Learning Research
, pages 1842–1850, New York, New York, USA, 20–22 Jun 2016. PMLR.  Schmidhuber et al. (1997) J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with successstory algorithm, adaptive levin search, and incremental selfimprovement. Machine Learning, 28(1):105–130, 1997.

Schroff et al. (2015)
F. Schroff, D. Kalenichenko, and J. Philbin.
Facenet: A unified embedding for face recognition and clustering.
In CVPR, pages 815–823, 2015.  Shyam et al. (2017) P. Shyam, S. Gupta, and A. Dukkipati. Attentive recurrent comparators. In ICML, pages 3173–3181, 2017.
 Snell et al. (2017) J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for fewshot learning. In NIPS, pages 4080–4090, 2017.
 Sung et al. (2018) F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for fewshot learning. In CVPR, 2018.
 Taigman et al. (2015) Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Webscale training for face identification. In CVPR, pages 2746–2754, 2015.
 Thrun (1998) S. Thrun. Lifelong learning algorithms. In Learning to learn, pages 181–209. Springer, 1998.
 Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, pages 6000–6010, 2017.
 Vinyals et al. (2016) O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In NIPS, pages 3630–3638. 2016.
 Wang et al. (2018) Y.X. Wang, R. Girshick, M. Hebert, and B. Hariharan. LowShot Learning from Imaginary Data. In CVPR, 2018.
Appendix S1 Architecture details
ResNet12 architecture details. The resnet blocks used in the ResNet12 feature extractor are shown in Fig. 1. The feature extractor consists of 4 resnet blocks shown in Fig. 0(b) followed by a global averagepool. Each resnet block consists of 3 convolutional blocks shown in Fig. 0(a) followed by 2x2 maxpool. Each convolutional layer is followed by a batch norm layer and the swish1 activation function proposed by Ramachandran et al. [2018]. We found that the fully convolutional architecture performs best as a fewshot feature extractor, both on miniImagenet and on FC100. We found that inserting additional projection layers after the ResNet stack was always detrimental to the fewshot performance. We crossvalidated this result with multiple hyperparameter settings for the projection layers (number of layers, layer widths, and dropout). In addition to that, we observed that adding extra convolutional layers and maxpool layers before the ResNet stack was detrimental to the fewshot performance. Therefore, we used fully convolutional, fully residual architecture in all our experiments.
The hyperparameters for the convolutional layers are as follows. The number of filters for the first ResNet block was set to 64 and it was doubled after each maxpool block. The
regularizer weight was crossvalidated at 0.0005 for each layer.TEN architecture details. The detailed architecture of the TEN block is depicted in Fig. 2. Our implementation of the TEN uses two separate fully connected residual networks to generate vectors . We crossvalidated the number of layers to be 3. The first layer projects the task representation into the target width. The target width is equal to the number of filters of the convolutional layer that the TEN block is conditioning (see Fig. 0(a)). The remaining layers operate at the target width and each of them has a skip connection. The regularizer weight for and was crossvalidated at 0.01 for each layer. We found that smaller values led to considerable overfit. In addition to that, we were not able to successfully train TEN without and , because the training tended to be stuck in local minima where the overall effect of introducing TEN was detrimental to the fewshot performance of the architecture.
Appendix S2 Fewshot CIFAR100 details
Train split. Superclass labels: {1, 2, 3, 4, 5, 6, 9, 10, 15, 17, 18, 19}; superclass names: {fish, flowers, food_containers, fruit_and_vegetables, household_electrical_devices, household_furniture, large_manmade_outdoor_things, large_natural_outdoor_scenes, reptiles, trees, vehicles_1, vehicles_2}.
Validation split. Superclass labels: {8, 11, 13, 16}; superclass names: {large_carnivores, large_omnivores_and_herbivores, noninsect_invertebrates, small_mammals}.
Test split. Superclass labels: {0, 7, 12, 14}; superclass names: {aquatic_mammals, insects, medium_mammals, people}.
We would like to stress that we still sample all the tasks uniformly at random within train, validation and test subsets. Therefore, each task with very high probability contains samples belonging to classes from several superclasses.
Appendix S3 Training procedure details
Episode composition. The training procedure composes a fewshot training batch from several tasks, where a task is understood to be a fixed selection of 5 classes. We found empirically that for the 5shot scenario the best number of tasks per batch was 2, for 10shot it was 1 and for 1shot it was 5. The sample set in each training batch was created using the same number of shots as in the target deployment (test) scenario. The images in the training query set were sampled uniformly at random. We observed that the best results were obtained when the number of query images was approximately equal to the total number of sample images in the batch. Thus we used 32 query images per task for 5shot, 64 for 10shot and 12 for 1shot.
The auxiliary classification task is based on the usual 64way training (for miniImagenet). Cotraining uses a fixed batch of 64 image samples sampled uniformly at random from the training set. The learning rate annealing schedule for the auxiliary task is synchronized with that of the main fewshot task.
Optimization, scheduling and learning rate. When training with auxiliary classification task we used total 30000 episodes for training on miniImagenet and 10000 episodes for training on FC100. The results obtained with no auxiliary classification cotraining used twice as many episodes. To obtain all our results we used SGD with momentum 0.9 and initial learning rate set at 0.1. The learning rate was annealed by a factor of 10 halfway through the training and two more times every 2500 episodes. The reported numbers are calculated using earlystopping based on validation set classification error tracking.
Classification accuracy evaluation. The accuracy is evaluated using 10 random restarts of the optimization procedure and based on 500 randomly generated tasks each having 100 random query samples.