TADAM: Task dependent adaptive metric for improved few-shot learning

05/23/2018 ∙ by Boris N. Oreshkin, et al. ∙ Element AI Inc 0

Few-shot learning has become essential for producing models that generalize from few examples. In this work, we identify that metric scaling and metric task conditioning are important to improve the performance of few-shot algorithms. Our analysis reveals that simple metric scaling completely changes the nature of few-shot algorithm parameter updates. Metric scaling provides improvements up to 14 5-way 5-shot classification task. We further propose a simple and effective way of conditioning a learner on the task sample set, resulting in learning a task-dependent metric space. Moreover, we propose and empirically test a practical end-to-end optimization procedure based on auxiliary task co-training to learn a task-dependent metric space. The resulting few-shot learning model based on the task-dependent scaled metric achieves state of the art on mini-Imagenet. We confirm these results on another few-shot dataset that we introduce in this paper based on CIFAR100.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans can learn to identify new categories from few examples, even from a single one (Carey and Bartlett, 1978). Few-shot learning has recently attracted significant attention (Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Santoro et al., 2016; Munkhdalai et al., 2018; Mishra et al., 2018), as it aims to produce models that can generalize from small amounts of labeled data. In the few-shot setting, one aims to learn a model that extracts information from a set of support examples (sample set) to predict the labels of instances from a query set. Recently, this problem has been reframed into the meta-learning framework Ravi and Larochelle (2016), i.e. the model is trained so that given a sample

set or task, produces a classifier for that specific task. Thus, the model is exposed to different tasks (or episodes) during the training phase, and it is evaluated on a non-overlapping set of new tasks

(Vinyals et al., 2016).

Two recent approaches have attracted significant attention in the few-shot learning domain: Matching Networks (Vinyals et al., 2016), and Prototypical Networks (Snell et al., 2017). In both approaches, the sample set and the query

set are embedded with a neural network, and nearest neighbor classification is used given a metric in the embedded space. Since then, the problem of learning the most suitable metric for few-shot learning has been of interest to the field 

(Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Munkhdalai et al., 2018; Mishra et al., 2018). Learning a metric space in the context of few-shot learning generally implies identifying a suitable similarity measure (e.g. cosine or Euclidean), a feature extractor mapping raw inputs onto similarity space (e.g. convolutional stack for images or LSTM stack for text), a cost function to drive the parameter updates, and a training scheme (often episodic). Although the individual components in this list have been explored, the relationships between them have not received considerable attention.

In the current work we aim to close this gap. We show that taking into account the interaction between the identified components leads to significant improvements in the few-shot generalization. In particular, we show that a non-trivial interaction between the similarity metric and the cost function can be exploited to improve the performance of a given similarity metric via scaling. Using this mechanism we close more than the 10% gap in performance between the cosine similarity and the Euclidean distance reported in 

(Snell et al., 2017). Even more importantly, we extend the very notion of the metric space by making it task dependent via conditioning the feature extractor on the specific task. However, learning such a space is in general more challenging than learning a static one. Hence, we find a solution in exploiting the interaction between the conditioned feature extractor and the training procedure based on auxiliary co-training on a simpler task. Our proposed few-shot learning architecture based on task-dependent scaled metric achieves superior performance on two challenging few-shot image classification datasets. It shows up to 8.5% absolute accuracy improvement over the baseline (Snell et al. (2017)), and 4.8% over the state-of-the-art (Munkhdalai et al., 2018) on the 5-shot, 5-way mini-Imagenet classification task, reaching 76.7% of accuracy, which is the best-reported accuracy on this dataset.

1.1 Background

We consider the episodic -shot, -way classification scenario. In this scenario, a learning algorithm is provided with a sample set consisting of examples for each of classes and a query set for a task to be solved within a given episode. The sample set provides the task information via observations and their respective class labels . Given the information in the sample set , the learning algorithm is able to classify individual samples from the query set . Next, we define a similarity measure . Note that does not have to satisfy the classical metric properties (non-negativity, symmetry, subadditivity) to be useful in the context of few-shot learning. The dimensionality of metric input,

, will most naturally be related to the size of embedding created by a (deep) feature extractor

, parameterized by , mapping to . Here is a list of parameters defining , e.g. a list of weights in a neural network. The set of representations can directly be used to solve the few-shot learning classification problem by association. For example, Matching networks (Vinyals et al., 2016) use sample-wise attention mechanism to perform kernel label regression. Instead, Snell et al. (2017) defined a feature representation for each class as the mean over embeddings belonging to : . To learn , they minimize using the softmax over prototypes to define the likelihood: .

1.2 Summary of contributions

Metric Scaling: To our knowledge, this is the first study to (i) propose metric scaling to improve performance of few-shot algorithms, (ii) mathematically analyze its effects on objective function updates and (iii) empirically demonstrate its positive effects on few-shot performance.

Task Conditioning: We use a task encoding network to extract a task representation based on the task’s sample set. This is used to influence the behavior of the feature extractor through FILM (Perez et al., 2018).

Auxiliary task co-training:

We show that co-training the feature extraction on a conventional supervised classification task reduces training complexity and provides better generalization.

1.3 Related work

Three main approaches for solving the few-shot classification problem can be identified in the literature. The first one, which is used in this work, is the meta-learning approach, i.e. learning a model that, given a task (set of labeled data), produces a classifier that generalizes across all tasks (Thrun, 1998; Schmidhuber et al., 1997). This is the case of Matching Networks (Vinyals et al., 2016)

, which optionally use a Recurrent Neural Network (RNN) to accumulate information about a given task. In MAML

(Finn et al., 2017), the parameters of an arbitrary learner model are optimized so that they can be quickly adapted to a particular task. In “Optimization as a model” (Ravi and Larochelle, 2016), a learner model is adapted to a new episodic task by a recurrent meta-learner producing efficient parameter updates. A more general approach was proposed by Santoro et al. (2016), where the meta-learner is trained to represent entries from a sample set in an external memory. Similarly, adaResNet (Munkhdalai et al., 2018) uses memory and the sample

set to produce shift coefficients on the neuron activations of the

query set classifier. Many recent approaches focus on learning a metric on the episodic feature space. Prototypical networks (Snell et al., 2017)

use a feed-forward neural network to embed the task examples and perform nearest neighbor classification with the class centroids. The relation network approach by 

Sung et al. (2018) introduces a separate learnable similarity metric. SNAIL (Mishra et al., 2018)

uses an explicit attention mechanism applicable both to supervised and to the sequence based reinforcement learning tasks. It has also been shown that these approaches benefit from leveraging unlabeled and simulated data 

(Ren et al., 2018; Wang et al., 2018).

A second approach aims to maximize the distance between examples from different classes (Koch et al., 2015). Similarly, in (Hadsell et al., 2006)

, a contrastive loss function is used to learn to project data onto a manifold that is invariant to deformations in the input space. In the same vein, in

(Fink, 2005; Schroff et al., 2015; Taigman et al., 2015), triplet loss is used for learning a representation for few-shot learning. The attentive recurrent comparators (Shyam et al., 2017) go beyond classical siamese approaches and use a recurrent architecture to learn to perform pairwise comparisons and predict if the compared examples belong to the same class.

The third approach relies on Bayesian modeling of the prior distribution of the different categories like in Li et al. (2006); Bauer et al. (2017), or Lake et al. (2013); Edwards and Storkey (2016); Lacoste et al. (2017) who rely on hierarchical Bayesian modeling.

As for task conditioning, Dumoulin et al. (2017); Perez et al. (2017, 2018)

proposed conditional batch normalization for style transfer and visual reasoning. Differently, we modify the conditioning scheme to adapt it to few-shot learning, introducing

priors, and auxiliary co-training. In the few-shot learning context, task conditioning ideas can be traced back to (Vinyals et al., 2016), although in an implicit form as there is no notion of task embedding. In our work, we explicitly introduce a task representation (see Fig. 1

) computed as the mean of the task class centroids (task prototypes). This is much simpler than individual sample level LSTM/attention models in 

(Vinyals et al., 2016). Conditioning in (Vinyals et al., 2016) is applied as a postprocessing of the output of a fixed feature extractor. We propose to condition the feature extractor by predicting its own batch normalization parameters thus making feature extractor behaviour task-dynamic without cumbersome fine-tuning on support set. In order to train the task conditioned architecture we use multitask training with a usual 64-way classification task. Even though auxiliary co-training is beneficial for learning in general, “little is known on when multitask learning works and whether there are data characteristics that help to determine its success” (Plank and Alonso, 2017). We show that combining task conditioning and auxiliary co-training is beneficial in the context of few-shot learning.

The scaling and temperature adjustment in the softmax was discussed by Hinton et al. (2015) in the context of model distillation. We propose to use it in the context of the few-shot learning scenario and provide novel theoretical and empirical results quantifying the effects of scaling parameter.

The rest of the paper is organized as follows. Section 2 describes our contributions in detail. Section 3 highlights the importance of each contribution via an ablation study. The study is performed over two different benchmarks in the regime of 1-shot, 5-shot and 10-shot learning to verify if conclusions hold across different setups. Finally, Section 4 concludes the paper and outlines future research directions.

2 Model Description

2.1 Metric Scaling

Snell et al. (2017) using approach described in detail in Section 1.1 found that the Euclidean distance outperformed the cosine distance used in Vinyals et al. (2016). We hypothesize that the improvement could be directly attributed to the interaction of the different scaling of the metrics with the softmax. Moreover, the dimensionality of the output is known to have a direct impact on the output scale even for the Euclidean distance (Vaswani et al., 2017). Hence, we propose to scale the distance metric by a learnable temperature, , , to enable the model to learn the best regime for each similarity metric, thus improving the performances of all metrics. To further understand the role of , we analyze the class-wise cross-entropy loss function, ,111Note that the total loss is simply

(1)

where is the query set corresponding to the class . Its gradient, which is used to update parameters is given by the following expression:

(2)

At first glance, the effect of on the expression of the derivative is twofold: (i) an overall scaling, and (ii) regulating the sharpness of weighting in the second term inside the brackets on the RHS. Below we explore the behavior of the -normalized222The effect of -related gradient scaling is trivial. gradient in the limits and .

Lemma 1 (Metric scaling).

If the following assumptions hold:

then it is true that:

(3)
(4)

where .

Proof.

Please refer to Appendix A. ∎

From Eq. (3), it is clear that for small values, the first term minimizes the embedding distance between query samples and their corresponding prototypes. The second term maximizes the embedding distance between the samples and the prototypes of the non-belonging categories. For large values (Eq. (4)), the first term is the same as in Eq. (3); while the second term maximizes the distance of the sample with the closest wrongly assigned prototype (if any). If (no error), the derivative contribution of the point is zero. This is equivalent to learning only from the hardest examples resulting in association errors. Thus, the two different regimes of favor either minimizing the overlap of the sample distributions or correcting cluster assignments sample-wise.

The large regime is more directly related to resolving the few-shot classification errors. At the same time, the update strategy generated in this regime has a drawback. As the optimization proceeds and the classification accuracy increases, the number of incorrectly classified samples reduces on average, and this leads to the reduction in the average effective batch size (more samples generate zero derivatives). Therefore, our hypothesis is that there is an optimal value of scaling parameter for a given combination of dataset, metric and task. Section 3.4 empirically demonstrates that the optimal value of indeed exists and it can be e.g. cross-validated on a validation set.

2.2 Task conditioning

Figure 1: Proposed few-shot architecture. Blocks with shared parameters have dashed border.

Up until now we assumed the feature extractor to be task-independent. A dynamic task-conditioned feature extractor should be better suited for finding correct associations between given sample set class representations and query samples, this is implicitly done by Vinyals et al. (2016) with a bidirectional LSTM as a postprocessing of a fixed feature extractor. Differently, we explicitly define a dynamic feature extractor , where is the set of parameters predicted from a task representation such that the performance of is optimized given the task sample set . This is related to the FILM conditioning layer (Perez et al., 2018) and conditional batch normalization (Dumoulin et al., 2017; Perez et al., 2017) of the form , where and

are scaling and shift vectors applied to the layer

. Concretely, we propose to use the mean of the class prototypes as the task representation, , encode it with a task embedding network (TEN), and predict layer-level element-wise scale and shift vectors for each convolutional layer in the feature extractor (see Figures 1 and 2 in the Supplementary Materials, Section S1). The task representation defined as the mean of task class centroids (i) reduces the dimensionality of the TEN input and (ii) replaces expensive RNN/CNN/attention modeling. On the other hand, it is an effective way to cluster tasks. Tasks having larger number of similar classes in common will tend to cluster closer in the task representation space.

Our implementation of the TEN (see Supplementary Materials, Section S1 for more details) uses two separate fully connected residual networks to generate vectors . Following the terminology in (Perez et al., 2017), the parameter is learned in the delta regime, i.e. predicting deviation from unity. The most critical component in being able to successfully train the TEN was the addition of the scalar penalized post-multipliers and . They limit the effect of (and ) by encoding a prior belief that all components of (and ) should be simultaneously close to zero for a given layer unless task conditioning provides a significant information gain for this layer. Mathematically, this can be expressed as and , where and are predictors of and .

2.3 Architecture

The overall proposed few-shot classification architecture is depicted in Fig. 1 (see Supplementary Materials, Section S1 for more details). We employ ResNet-12 (He et al., 2016)

as the backbone feature extractor. It has 4 blocks of depth 3 with 3x3 kernels and shortcut connections. 2x2 max-pool is applied at the end of each block. Convolutional layer depth starts with 64 filters and is doubled after every max-pool. Note that this architecture is similar in spirit to architectures used in 

(Bauer et al., 2017) and (Munkhdalai et al., 2018), but we do not use any projection layers before or after the main backbone ResNet. On the first pass over sample set, the TEN predicts the values of and parameters for each convolutional layer in the feature extractor from the task representation. Next, the sample set and the query set are processed by the feature extractor conditioned with the values of and just generated. Both outputs are fed into a similarity metric to find an association between class prototypes and query instances. The output of similarity metric is scaled by scalar

and is fed into a softmax layer.

1-shot 5-shot 10-shot
Meta Nets (Ravi and Larochelle, 2016) 43.4 60.6 -
Matching Networks (Vinyals et al., 2016) 46.6 60.0 -
MAML (Finn et al., 2017) 48.7 63.1 -
Proto Nets (Snell et al., 2017) 49.4 68.2
Relation Net (Sung et al., 2018) 50.4 65.3 -
SNAIL (Mishra et al., 2018) 55.7 68.9 -
Discriminative k-shot (Bauer et al., 2017) 56.3 73.9 78.5
adaResNet (Munkhdalai et al., 2018) 56.9 71.9 -
Ours 58.5 76.7 80.8
Table 1: mini-Imagenet (Vinyals et al. (2016)), 5-way classification results. Our re-implementation.

2.4 Auxiliary task co-training

The TEN (Section 2.2

) introduces additional complexity into the architecture via task conditioning layers inserted after the convolutional and batch norm blocks. We empirically observed that simultaneously optimizing convolutional filters and the TEN is overly challenging. We solved the problem by auxiliary co-training with an additional logit head (the normal 64-way classification in mini-Imagenet case). The auxiliary task is sampled with a probability that is annealed over episodes. We annealed it using an exponential decay schedule of the form

, where is the total number of training episodes, is episode index. The initial auxiliary task selection probability was cross-validated to be and the number of decay steps was chosen to be 20. We observed significant positive effects from the auxiliary task co-training (please refer to Section 3.4). The same positive effects were not observed with simple pre-training of the feature extractor. We attribute this to the regularization effects achieved via back-propagating auxiliary task gradients together with those of the main task.

It is of interest to note that the few-shot co-training with an auxiliary classification task is related to curriculum learning (Santoro et al., 2016). The auxiliary classification problem could be considered a part of a simpler curriculum that helps the learner acquire minimal skill level necessary before tackling on harder few-shot classification tasks. Being effective at feature extraction (i.e. at task representation) forms a “prerequisite” at being effective at re-conditioning features based on the representation of a given task.

3 Experimental Results

Table 1 presents our key result in the context of existing state-of-the art. The five first rows show approaches that use the same feature extractor as (Vinyals et al., 2016), i.e. four stacked convolutions layers of 64 filters (32 in Ravi and Larochelle (2016); Finn et al. (2017) to avoid overfitting). In the following rows we include models like the one we propose, which is based on resnet He et al. (2016). Concretely, SNAIL Mishra et al. (2018), adaResNet Munkhdalai et al. (2018), and our architecture use four residual blocks of three stacked convolutional layers, each block followed by max pooling. Differently, the feature extractor proposed in Bauer et al. (2017) is based on a ResNet-34 architecture with a reduced number of features.

As it can be seen, the proposed algorithm significantly improves over the existing state-of-the-art results on the mini-Imagenet dataset. In the rest of the section we address the following research questions: (i) can metric scaling improve few-shot classification results? (Sections 3.2 and 3.4), (ii) what are the contributions of each components of our proposed architecture? (Section 3.4), (iii) can task conditioning improve few-shot classification results and how important it is at different feature extractor depths? (Sections 3.3 and 3.4), and (iv) can auxiliary classification task co-training improve accuracy on the few-shot classification task? (Section 3.4).

3.1 Experimental setup and datasets

The details of the experimental and training setup are provided in Supplementary Materials, Section S3. Note that we focused on mini-Imagenet Vinyals et al. (2016) and Fewshot-CIFAR100 (introduced below) instead of Omniglot Lake et al. (2015); Vinyals et al. (2016); Snell et al. (2017) as the former ones are more challenging, and the error rate is more sensitive to model improvements.

mini-Imagenet. The mini-Imagenet dataset was proposed by Vinyals et al. (2016). It has 100 classes, with 600 images per class. Each task is generated by sampling 5 classes uniformly and 5 training samples per class, the remaining images from the 5 classes are used as query images to compute accuracy. To perform meta-validation and meta-test on unseen tasks (and classes), we isolate 16 and 20 classes from the original set of 100, leaving 64 classes for the training tasks. We use exactly the same train/validation/test split as the one suggested by Ravi and Larochelle (2016).

Fewshot-CIFAR100. We introduce a new image based dataset based on CIFAR100 (Krizhevsky, 2009) for few-shot learning. We will refer to it as FC100. The main motivation for introducing this new dataset is to validate that the main results appearing in the experimental section generalize well beyond the mini-Imagenet. The secondary motivation is that the FC100 is suited for faster few-shot scenario prototyping than the mini-Imagenet and it presents a more challenging few-shot learning problem, because of reduced image size. On top of that, we propose a class split in FC100 to minimize the information overlap between splits to make it significantly more challenging than e.g. Omniglot. The original CIFAR100 dataset consists of color images belonging to 100 different classes, 600 images per class. The 100 classes are further grouped into 20 superclasses. We split the dataset by superclass, rather than by individual class to minimize the information overlap. Thus the train split contains 60 classes belonging to 12 superclasses, the validation and test contain 20 classes belonging to 5 superclasses each. The exact class split is provided in Supplementary Materials, Section S2. The tasks are sampled uniformly at random within train, validation and test subsets. Therefore, each task with high probability contains samples belonging to classes from several superclasses.

3.2 On the similarity metric

mini-Imagenet FC100
5-way train 20-way train 5-way train 20-way train
Proto Nets Snell et al. (2017) 65.8 0.7 68.2 0.7 N/A N/A
Proto Nets 67.7 0.2 68.9 0.3 51.1 0.2 50.3 0.3
Prototypical Cosine 54.5 1.1 53.9 0.6 40.9 0.6 37.1 1.9
Prototypical Cosine Scaled 68.2 0.8 68.1 0.7 51.0 0.6 49.6 0.5
Table 2:

Average classification accuracy in percent with 95% confidence interval. 5-shot, 5 way classification task. The three last rows correspond to our implementation, first with euclidean distance, second with cosine distance, and third with the scaled cosine distance.

We re-implemented prototypical networks (Snell et al., 2017), and use the Euclidean and the cosine similarity to test the effects of scaling (see Section 2). We closely follow the experimental setup defined by Snell et al. (2017) (same feature extractor and training procedure). The scaling parameter used on the last row was cross-validated on the validation set. Results are presented in Table 2.

As it can be seen in row two of Table 2, our re-implementation of Proto Nets (Snell et al., 2017) obtained slightly better performance (68.9% and 67.7%) in 20-way and 5-way training scenarios respectively by increasing the number of training steps from 20K to 40K333With 20K steps it was possible to recover the exact original performance reported in Snell et al. (2017), which is not included in Table 2 for the sake of brevity..

Importantly, we confirm the hypothesis that the improvement attributed to the Euclidean distance in Snell et al. (2017) was due to a scaling effect. Namely, we show that the scaled cosine similarity matches very closely the performance of the Euclidean metric, with an improvement of 14 percentage points on the mini-Imagenet (similar results on FC100) over the non-scaled version. In order to control for the potential effect that the scaling parameter may have on the learning rate as indicated by Equation (2) training was performed using multiple initial learning rates (covering the range between 0.0005 and 0.01), obtaining similar accuracy each time. Hereinafter, we report the results with the Euclidean metric for brevity, since the cosine produces similar results. Moreover, since the prototypical approach with Euclidean distance as well as with the scaled cosine are close and both are superior to Vinyals et al. (2016), we base our results on Snell et al. (2017).

3.3 TEN importance across layers

(a) Results on mini-Imagenet.
(b) Results on FC100.
Figure 2: Distribution of the absolute values of the TEN scaling and bias parameters and across layers of ResNet feature extractor. X-axes depict layer number in both subplots. Higher convolutional layers are located closer to the final softmax layer.

We hypothesized in Section 2.2 that the TEN conditioning should not be equally important at all depths. Fig. 2 depicts the boxplot of the empirical observations of the learned TEN post-multipliers444Larger absolute values of and imply a larger influence of their respective TEN layers and at different depths of the feature extractor. We can see that for the multiplier , the absolute value of its scale tends to increase as we approach the softmax layer. Interestingly, peaks can be observed every 3 layers (layers 3, 6, 9, 12). The peaks correspond to the location of the convolutional layers preceding the max-pool layers. For the bias parameter , the only layer having a large absolute value of its scale is the last layer, before the softmax. We attribute the observed pattern to the fact that the shallower layers in the feature extractor tend to be less task-specific than the deeper layers. Following this intuition, we performed experiments in which we (i) kept the TEN injection solely in layers preceding the max pool and (ii) kept the TEN injection only in the very last layer. Interestingly, we saw that TEN layers with small weight still provide some positive contribution, although most of the contribution is indeed provided by the layers preceding the max pool operation.

3.4 Ablation study

mini-Imagenet FC100
AT TC 1-shot 5-shot 10-shot 1-shot 5-shot 10-shot
56.5 0.4 74.2 0.2 78.6 0.4 37.8 0.4 53.3 0.5 58.7 0.4
56.8 0.3 75.7 0.2 79.6 0.4 38.0 0.3 54.0 0.5 59.8 0.3
58.0 0.3 75.6 0.4 80.0 0.3 39.0 0.4 54.7 0.5 60.4 0.4
54.4 0.3 74.6 0.3 78.7 0.4 37.8 0.2 54.0 0.7 58.8 0.3
58.5 0.3 76.7 0.3 80.8 0.3 40.1 0.4 56.1 0.4 61.6 0.5
Table 3: Average classification accuracy (%) with 95% confidence interval on the 5 way classification task, and training with the Euclidean distance. The scale parameter is cross-validated on the validation set. AT: auxiliary co-training. TC: task conditioning with TEN.
(a) Scaled Euclidean. mini-Imagenet.
(b) Scaled Euclidean. FC100.
(c) Scaled Euclidean with TEN. mini-Imagenet.
(d) Scaled Euclidean with TEN. FC100.
Figure 3: Metric scale parameter cross-validation results.

In this section, we study the impact in generalization accuracy of the scaling, task conditioning, auxiliary co-training, and the feature extractor. Results are summarized in Table 3.

First, we validated the hypothesis that there is an optimal value of the metric scaling parameter () for a given combination of dataset and metric, which is reflected in the inverse U-shape of the curves in Fig. 3.

Second, we studied the effects of the task conditioning described in Section 2.2. No improvement was observed for the task-conditioned ResNet-12 without auxiliary co-training (see Table 3). We observed that learning useful features for the TEN and the main feature extractor at the same time is hard and gets stuck in local extrema. The problem is solved by co-training on the auxiliary task of predicting Imagenet labels using an additional fully-connected layer with softmax, see Section 2.4. In effect, we observed that auxiliary co-training provides two benefits: (i) making the initial convergence easier, and (ii) providing regularization on the few-shot learning task by forcing the feature extractor to perform well on two decoupled tasks. The latter benefit can only be observed when the feature extraction unit is sufficiently decoupled on the main task and the auxiliary task via the use of TEN (the feature extractor output is additionally adjusted on the target task using FILM).

As it can be seen in the last row of Tables 1 and 3, our model trained with TEN and auxiliary co-training outperforms all the baselines and achieves state-of-the-art results.

4 Conclusions and Future Work

We proposed, analyzed, and empirically validated several improvements in the domain of few-shot learning. We showed that the scaled cosine similarity performs at par with Euclidean distance, unlike its unscaled counterpart. In fact, based on our results, we argue that the scaling factor is a necessary standard component of any few-shot learning algorithm relying on a similarity metric and the cross-entropy loss function. This is especially important in the context of finding new more effective similarity measures for few-shot learning. Moreover, our theoretical analysis demonstrated that simply scaling the similarity metric results in completely different regimes of parameter updates when using softmax and categorical cross-entropy. We also identified that the optimal performance is achieved in between two asymptotic regimes of the softmax. This poses the research question of explicitly designing loss functions and the schedules optimal for few-shot learning. We further proposed task representation conditioning as a way to improve the performance of a feature extractor on the few-shot classification task. In this context, designing more powerful task representations, for example, based on higher order statistics of class embeddings, looks like a very promising venue for future work. The experimental results obtained on two independent challenging datasets demonstrated that the proposed approach significantly improves over existing results and achieves state-of-the-art on few-shot image classification task.

Appendix

Appendix A Proof of Lemma 1

First, consider the case . Denoting we have:

Second, consider the case :

It is obvious that whenever at least one of the exponential terms in the denominator in the expression above has positive rate, corresponding to the case , the ratio converges to zero as under assumption . The only case when the limit is non-zero is when is the prototype closest to the query point . If we define the index of this prototype as , then the following holds: , leading (under additional assumption ) to:

Therefore, (4) follows. ∎

Acknowledgements

Authors acknowledge the support of the Spanish project TIN2015-65464-R (MINECO/FEDER), the 2016FI B 01163 grant of Generalitat de Catalunya. Authors would like to thank Nicolas Chapados, Adam Salvail and Rachel Samson as well as anonymous reviewers for their careful reading of the manuscript and for providing constructive feedback and valuable suggestions.

References

  • Bauer et al. (2017) M. Bauer, M. Rojas-Carulla, J. B. Świątkowski, B. Schölkopf, and R. E. Turner. Discriminative k-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.
  • Carey and Bartlett (1978) S. Carey and E. Bartlett. Acquiring a single new word. 1978.
  • Dumoulin et al. (2017) V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. ICLR, 2017.
  • Edwards and Storkey (2016) H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
  • Fink (2005) M. Fink. Object classification from a single example utilizing class relevance metrics. In NIPS, pages 449–456, 2005.
  • Finn et al. (2017) C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126–1135, 2017.
  • Hadsell et al. (2006) R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, volume 2, pages 1735–1742. IEEE, 2006.
  • He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, pages 770–778, 2016.
  • Hinton et al. (2015) G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. URL http://arxiv.org/abs/1503.02531.
  • Koch et al. (2015) G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recognition. In

    ICML Deep Learning Workshop

    , volume 2, 2015.
  • Krizhevsky (2009) A. Krizhevsky. Learning multiple layers of features from tiny images.  , University of Toronto, 2009.
  • Lacoste et al. (2017) A. Lacoste, T. Boquet, N. Rostamzadeh, B. Oreshkin, W. Chung, and D. Krueger. Deep prior. arXiv preprint arXiv:1712.05016, 2017.
  • Lake et al. (2013) B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum. One-shot learning by inverting a compositional causal process. In NIPS, pages 2526–2534, 2013.
  • Lake et al. (2015) B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Li et al. (2006) F.-F. Li, R. Fergus, and P. Perona. One-shot learning of object categories. PAMI, 28(4):594–611, 2006.
  • Mishra et al. (2018) N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel. A simple neural attentive meta-learner. In ICLR, 2018.
  • Munkhdalai et al. (2018) T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler. Rapid adaptation with conditionally shifted neurons. In ICML, 2018.
  • Perez et al. (2017) E. Perez, H. de Vries, F. Strub, V. Dumoulin, and A. C. Courville. Learning visual reasoning without strong priors. CoRR, abs/1707.03017, 2017.
  • Perez et al. (2018) E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
  • Plank and Alonso (2017) B. Plank and H. M. Alonso. When is multitask learning effective? Semantic sequence prediction under varying data conditions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, pages 44–53, 2017.
  • Ramachandran et al. (2018) P. Ramachandran, B. Zoph, and Q. V. Lea.

    Searching for activation functions.

    In ICLR, 2018.
  • Ravi and Larochelle (2016) S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In ICLR, 2016.
  • Ren et al. (2018) M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky, J. B. Tenenbaum, H. Larochelle, and R. S. Zemel. Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676, 2018.
  • Santoro et al. (2016) A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In M. F. Balcan and K. Q. Weinberger, editors, ICML, volume 48 of

    Proceedings of Machine Learning Research

    , pages 1842–1850, New York, New York, USA, 20–22 Jun 2016. PMLR.
  • Schmidhuber et al. (1997) J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28(1):105–130, 1997.
  • Schroff et al. (2015) F. Schroff, D. Kalenichenko, and J. Philbin.

    Facenet: A unified embedding for face recognition and clustering.

    In CVPR, pages 815–823, 2015.
  • Shyam et al. (2017) P. Shyam, S. Gupta, and A. Dukkipati. Attentive recurrent comparators. In ICML, pages 3173–3181, 2017.
  • Snell et al. (2017) J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks for few-shot learning. In NIPS, pages 4080–4090, 2017.
  • Sung et al. (2018) F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
  • Taigman et al. (2015) Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Web-scale training for face identification. In CVPR, pages 2746–2754, 2015.
  • Thrun (1998) S. Thrun. Lifelong learning algorithms. In Learning to learn, pages 181–209. Springer, 1998.
  • Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, pages 6000–6010, 2017.
  • Vinyals et al. (2016) O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for one shot learning. In NIPS, pages 3630–3638. 2016.
  • Wang et al. (2018) Y.-X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Low-Shot Learning from Imaginary Data. In CVPR, 2018.

Appendix S1 Architecture details

(a) Convolutional block with TEN.
(b) Resnet block with TEN.
Figure 1: Components of the ResNet-12 feature extractor.

ResNet-12 architecture details. The resnet blocks used in the ResNet-12 feature extractor are shown in Fig. 1. The feature extractor consists of 4 resnet blocks shown in Fig. 0(b) followed by a global average-pool. Each resnet block consists of 3 convolutional blocks shown in Fig. 0(a) followed by 2x2 max-pool. Each convolutional layer is followed by a batch norm layer and the swish-1 activation function proposed by Ramachandran et al. [2018]. We found that the fully convolutional architecture performs best as a few-shot feature extractor, both on mini-Imagenet and on FC100. We found that inserting additional projection layers after the ResNet stack was always detrimental to the few-shot performance. We cross-validated this result with multiple hyper-parameter settings for the projection layers (number of layers, layer widths, and dropout). In addition to that, we observed that adding extra convolutional layers and max-pool layers before the ResNet stack was detrimental to the few-shot performance. Therefore, we used fully convolutional, fully residual architecture in all our experiments.

The hyperparameters for the convolutional layers are as follows. The number of filters for the first ResNet block was set to 64 and it was doubled after each max-pool block. The

regularizer weight was cross-validated at 0.0005 for each layer.

TEN architecture details. The detailed architecture of the TEN block is depicted in Fig. 2. Our implementation of the TEN uses two separate fully connected residual networks to generate vectors . We cross-validated the number of layers to be 3. The first layer projects the task representation into the target width. The target width is equal to the number of filters of the convolutional layer that the TEN block is conditioning (see Fig. 0(a)). The remaining layers operate at the target width and each of them has a skip connection. The regularizer weight for and was cross-validated at 0.01 for each layer. We found that smaller values led to considerable overfit. In addition to that, we were not able to successfully train TEN without and , because the training tended to be stuck in local minima where the overall effect of introducing TEN was detrimental to the few-shot performance of the architecture.

Figure 2: Architecture of the TEN block.

Appendix S2 Few-shot CIFAR100 details

Train split. Super-class labels: {1, 2, 3, 4, 5, 6, 9, 10, 15, 17, 18, 19}; super-class names: {fish, flowers, food_containers, fruit_and_vegetables, household_electrical_devices, household_furniture, large_man-made_outdoor_things, large_natural_outdoor_scenes, reptiles, trees, vehicles_1, vehicles_2}.

Validation split. Super-class labels: {8, 11, 13, 16}; super-class names: {large_carnivores, large_omnivores_and_herbivores, non-insect_invertebrates, small_mammals}.

Test split. Super-class labels: {0, 7, 12, 14}; super-class names: {aquatic_mammals, insects, medium_mammals, people}.

We would like to stress that we still sample all the tasks uniformly at random within train, validation and test subsets. Therefore, each task with very high probability contains samples belonging to classes from several superclasses.

Appendix S3 Training procedure details

Episode composition. The training procedure composes a few-shot training batch from several tasks, where a task is understood to be a fixed selection of 5 classes. We found empirically that for the 5-shot scenario the best number of tasks per batch was 2, for 10-shot it was 1 and for 1-shot it was 5. The sample set in each training batch was created using the same number of shots as in the target deployment (test) scenario. The images in the training query set were sampled uniformly at random. We observed that the best results were obtained when the number of query images was approximately equal to the total number of sample images in the batch. Thus we used 32 query images per task for 5-shot, 64 for 10-shot and 12 for 1-shot.

The auxiliary classification task is based on the usual 64-way training (for mini-Imagenet). Co-training uses a fixed batch of 64 image samples sampled uniformly at random from the training set. The learning rate annealing schedule for the auxiliary task is synchronized with that of the main few-shot task.

Optimization, scheduling and learning rate. When training with auxiliary classification task we used total 30000 episodes for training on mini-Imagenet and 10000 episodes for training on FC100. The results obtained with no auxiliary classification co-training used twice as many episodes. To obtain all our results we used SGD with momentum 0.9 and initial learning rate set at 0.1. The learning rate was annealed by a factor of 10 halfway through the training and two more times every 2500 episodes. The reported numbers are calculated using early-stopping based on validation set classification error tracking.

Classification accuracy evaluation. The accuracy is evaluated using 10 random restarts of the optimization procedure and based on 500 randomly generated tasks each having 100 random query samples.

Reproducing results in [Snell et al., 2017]. To reproduce the results reported in [Snell et al., 2017] we used exactly the same setup and network architecture reported in the original paper.