1 Introduction
Deep neural networks have achieved great success on many practical applications recently and even have surpassed humans in image recognition domain
[8, 15, 28]. However, these successes largely benefit from tens of thousands of labeled data, enormous number of parameters and sophisticated training strategies. In terms of the lowdata scenarios, such as medical images classification and marine biological recognition, deep neural networks will be overfitting and severely collapse because only rare labeled samples are provided for training. To this end, by exploring the exciting idea that tries to learn new concepts rapidly and generalize well with extremely few examples, or even single, fewshot learning has attracted significant research interest recently [5, 19, 24, 27, 29, 35, 38].Concretely, we focus on the case of fewshot classification with the metalearning paradigm, which leverages a series of independent and identically distributed fewshot tasks to learn a desired classifier at training time and applies the model directly to nonoverlap unseen target classification problem during testing phase. The socalled episodictraining strategy, which has been widely developed in many previous work [29, 35], keeps the consistency between the training and real test scenario and improve the generalization performances. In each task (or episode), only a handful of labeled examples per classes (the support set) are provided to the classifier for training and then plenty of unlabeled points (the query set) need to be assigned labels for prediction. Specifically, [29, 35, 38] jointed the metalearning with metric learning as a feedforward manner, with optimizing a shared distance metric across all tasks. Moreover, [5] proposed to learn a general initialization strategy and [17, 24] learned to update the parameters of learner directly with a higherlevel metaoptimizer (e.g. LSTM).
While these approaches based on metalearning paradigm have made significant advances in fewshot classification, they do suffer from two distinct limitations. One is that all examples from various different tasks are embedded into a taskindependent metric space indiscriminately, see Fig. 1 (left). Namely, this assumption does not take the tasklevel information (metadata) into account but examplelevel feature only, which neglects the specificity of different tasks. Actually, what is missing of this idea is an adaptive module that tailors the metric space for each task. The other is that most current methods follow the setup of inductive inference, that is, training the metalearner with severely limited support data and predicting queries one by one in each task. Obviously, this process does not adequately consider the interaction between the support set and unlabeled test set and thus weakens the advantages of metalearning.
To deal with the key challenge of how to learn a generalizable classifier with the capability of adapting to specific tasks with severely limited data, we propose a novel metalearning framework for fewshot classification, named as Transductive Episodicwise Adaptive Metric (TEAM), which efficiently tailors an episodicwise metric space with applying the idea of transductive inference. Specifically, we put forward not only to learn a taskagnostic instance embedding model endtoend over a pool of fewshot tasks as a metalearning manner, but also constructs a taskspecific distance metric explicitly for each task with the distinctive information, such as the pairwise constraints and regularization prior, see Fig. 1 (right). Furthermore, we formulate the optimization process for taskspecific metric into a standard semidefinite programming (SDP) problem [2] and finally acquire the closedform solution by solving the SDP problem on the fly. Hereafter all features generated by the taskagnostic model are adapted into taskspecific metric spaces where samples from the same class are closer and different classes are farther apart. According to the transformed embeddings, we then perform a novel attentionbased bidirectional similarity strategy to compute more robust relationship between each unlabeled query and class prototypes, which further improves the performance of our approach. In addition, by utilizing the convex combination of all samples within each task to construct auxiliary training tasks, we propose a tasklevel data augmentation technique for boosting the generalization of the embedding model. The whole framework is illustrated in Fig. 2 in details.
The main contribution is summarized as threefold. (1) We propose a general metalearning framework of applying the transductive inference on formulating the adaptation procedure into a SDP problem and tailoring the episodicwise metric in each task for fewshot learning, which can also be directly extended to other existing methods and even semisupervised learning. (2) We identify a novel bidirectional similarity strategy for extracting the more robust relationship between queries and prototypes. (3) The experimental results on three benchmark datasets show that our framework is superior to other stateoftheart approaches.
2 Related Work
Metalearning in Fewshot Learning. Metalearning, or learning to learn [31]
, is the technique of observing how different machine learning approaches perform on a wide range of tasks rather than batches of data points, and then starting from this experience, or metadata, to learn new tasks much faster and obtain higher performance. In recent fewshot learning literature, more and more approaches follow the idea of metalearning to alleviate overfitting. MetaLSTM
[24] aims to learn efficient parameter updating rules for training neural network (learner) with a LSTMbased metalearner. MAML [5], on the other hand, tries to learn a good model parameter initialization strategy that generalizes better to similar tasks. Similarly, Reptile [21]is an approximation of MAML that executes stochastic gradient descent with a firstorder form. MetaSGD
[17] goes further in metalearning by arguing to learn the weights initialization, gradient update direction and learning rate within a single step. However, these approaches mentioned above suffer from the issue of finetuning. In contrast, after optimizing a metalearner over a series of tasks, our approach solves unseen target tasks in an feedforward manner without any further model updates and just optimize an episodicwise distance metric efficiently for each task, thus avoiding gradient computation and serious overfitting.Distance Metric Learning Approaches. Another category of approach focus on obtaining a generalizable embedding model to transform all samples into a common metric space where can perform simple classifiers such as nearest neighbor directly. Matching Network [35] integrates metric learning with metalearning for the first time by training a learnable nearest neighbor classifier with deep neural networks. Prototypical Network [29]
utilizes class prototype representations to assign labels for query points and formulates the final loss function with euclidean distance directly. Later, this work is extended to semisupervised fewshot scenario by
[25], unlabeled data are used to refine the class prototypes. Relation Network [38] trains an auxiliary network to compute the similarity score between each query and the support set, which is equivalent to further learn a nonlinear metric. These approaches assume all samples are embedded into a taskagnostic metric space. Instead, our framework intends to emphasize the specificity among different tasks, that is, samples from different tasks should be embedded into a more discriminative taskspecific space.Task Adaptation Approaches. The third family of approaches, which aims to mine the task adaptability with the metalearning paradigm, is currently a hot research direction for fewshot learning. MTNet [16] proposes that the metalearner learns on each layer’s activation space and the taskspecific learner performs gradient descent on a subspace, which is more sensitive to task identity. TADAM [22]
proposes a Task Encoding Network to produce scaling and shift vectors for each layer weights, leading to a taskdependent metric space. Our approach is slightly similar to these methods in the sense that we all focus on the idea of task adaptation. However, without imposing auxiliary network or applying complicated training strategy, our proposed framework formulates the adaptation procedure into a standard SDP problem under the transductive inference setting, which is more efficient and convenient.
Transductive Inference. Transductive inference [34]
follows the setting of generalizing from the training set to the test set directly and avoiding the intermediate problem of estimating a function. In datascarce scenario, it can significantly improve the performance over inductive methods. Reptile
[21], which implicitly shared information between all test samples via batch normalization
[11], is the first work of applying the transductive setting in fewshot learning. TPN [18] models the transductive inference explicitly by learning a graph construction module, which propagates labels from labeled instances to unlabeled query points directly. Different from using examplewise parameters for the static Euclidean distance in [18], our approach explores the episodicwise distance metric by integrating the small support set with the entire query set for transduction.3 Transductive Episodicwise Adaptive Metric
In this section, we describe the definition of fewshot learning problem (FSL) and then introduce the Transductive Episodicwise Adaptive Metric (TEAM) in details.
3.1 Problem Formulation
Just like what is employed in various previous works [25, 29, 35, 38], we organize the learning procedure into the form of episodic paradigm, which gradually collects metaknowledge across a pool of source tasks and performs adaptation on target tasks quickly. Under this setting, the ultimate goal of our algorithm is to train models with a large labeled dataset , which is composed of a set of seen classes , and apply the classifiers on a novel testing set with many unseen classes . Note that there are only a few labeled examples for each category in and . In order to mimic the fewshot test scenario during the training procedure and take full advantage of the large quantities of labeled , we reorganize all examples in as a series of way shot tasks (or episodes). Concretely, way shot task is usually constructed by first selecting classes from randomly and then generate a support set and a query set from the selected classes. The support set includes samples per classes, termed as , and whilst the query set contains different samples from the same label space with . In each episode, we train the learner with small labeled support set and minimize the loss on the large query set . After training episode by episode until convergence, the learned model can perform pretty well on novel fewshot tasks.
However, we argue that it is not ideal to apply the learned model to all target tasks directly without considering the specificity of them. Our approach proposes an episodicwise metric construction module to transform the taskagnostic embeddings into a taskspecific metric space, namely episodic adaptation. Moreover, in order to mitigate the datascarce problem of support set and construct a more generalizable taskadaptive metric, we follow the paradigm of transductive inference and consider the query set as a whole for prediction, instead of one by one.
Specifically, our fewshot classification framework (see Fig. 2) is composed of three modules: (1) learning a taskagnostic feature extractor to embed raw inputs into a shared embedding space, this procedure includes a loss function to drive the parameters update and a novel tasklevel augmentation strategy to boost the generalization, (2) tailoring an episodicwise adaptive metric for each task by solving a standard SDP problem efficiently and (3) performing a novel bidirectional similarity strategy in the taskspecific space to assign label for each query. It is worth noting that the latter two modules utilize the setting of transduction. We describe the details of each module in the following sections.
3.2 Taskagnostic Feature Extractor
Learning Embedding Function. Our approach first employs an embedding function to extract feature of an instance , where refers to the embedding of and indicates the parameters of deep model. Given the fewshot task sequence from , we train the feature extractor
episode by episode with minimizing the negative logprobability of true label for each sample via SGD as:
(1) 
where is a fewshot episode sampled from task sequence randomly, is the novel tasklevel data augmentation operation and stands for the episodicwise adaptation metric of . Until convergence, the optimal embedding model is applied directly to unseen target tasks which are sampled from instead of .
Task Internal Mixing Augmentation Strategy. Recently, some data augmentation techniques, such as flipping, rotation or distorting the inputs, follow the learning principle named Vicinal Risk Minimization (VRM) [3] to improve the generalization performance of deep neural networks. Inspired by [10, 39], we further propose a tasklevel data augmentation technique which is termed as Task Internal Mixing (TIM), performing the convex combination between all support samples in each task to synthesize new episodes. To be concrete, for each instance in a source task, we randomly select another sample from the same task and synthesize new training examples as follows:
(2) 
where and . Note and
are just the raw input tensors instead of features. Then we handle each instance a few times with Eq. (
2) to form a virtual task (i.e. in Eq. (1) ) for training. In essence, TIM extends the source task distribution by incorporating the prior that if two samples are similar to each other in the original pixel space, then they are likely to be closer in the feature space. As such, and are more similar to each other than and in Eq. (2) because of , which leads to the synthetic label should be instead of .3.3 Episodicwise Adaptive Metric
Distance with Metric . Given two embeddings and in vector space , we denote the distance between them with the metric as follows:
(3) 
where stands for the trace operator and is a symmetric positive semidefinite matrix, which ensures that satisfies the properties of a pseudodistance [1]. In general, the matrix parameterizes a family of Mahalanobis distances in the vector space . In particular, the in Eq. (3) will degenerate into the popular Euclidean distance if we set , that is, assuming all features are equally scaled and equally relevant [32]. Inspired by these observations, we proposed to explicitly construct episodicwise adaptive metric by leveraging the specific information of each task, such as the pairwise constraints and the regularization prior.
Pairconstrained Loss. Given a fewshot task, our goal is to minimize the mean of distances between all similar sample pairs (mustlink constraints, denoted with ) while keeping the mean of distances between all dissimilar sample pairs (cannotlink constraints, denoted with ) larger than 1 at meanwhile. Motivated by the above idea, we formulate the MinMax principle as a convex optimization problem in Eq. (4) by adopting the square distance in terms of its effectiveness and efficiency.
(4) 
Based on Eq. (3) and the methods of lagrange multiplier, we rewrite the Eq. (4) into a pairconstrained loss function as:
(5) 
where is the multiplier, and have the following form:
(6) 
Furthermore, given a way shot task with support set and its query set , we first shrink the support set to a prototype set by:
(7) 
where is the subset which contains the samples with the same label in . The we define the similar and dissimilar constraints with and where is a prototype with label in , is a set of nearest neighbors of in the query set and is the prototype set from the seen classes .
Regularization Loss. Without imposing any restrictions or prior information into the Eq. (5), we have parameters to be optimized for each task due to , while we can only construct a few pairwise constraints in fewshot scenario. From the perspective of machine learning theory, this inconsistency will lead to severe overfitting of our models. To this end, we propose the second principle to regularize the episodicwise metric to be close to a given metric , which is associated with the prior across all fewshot tasks. Specifically, we try to minimize the Bregman divergence between and with the logdeterminant function , which is a strict convex, continuously differentiable divergence function. Then we formulate the regularization loss function as:
(8) 
where means the trace operator on matrix and Eq. (8) ignores the constant term regarding . More precisely, with the informationtheory, optimizing
is equivalent to minimizing the KL divergence between two multivariate Gaussian distributions parameterized by
and .Episodicwise Adaptive Metric. By integrating two principles mentioned above, we formulate a novel episodicwise adaptive metric (EAM) loss function for each task as:
(9) 
where and are defined in Eq. (5) and Eq. (8) respectively, is a positive tradeoff parameter. In general, minimizing Eq. (9) with SGDbased optimizers or other existing convex optimization solvers [30] will produce a local optimal solution for each task. However, due to the high time complexity of SDP solvers and requiring too many iterations with SGDbased optimizers, this suboptimization procedure within each task will lead to an inefficient learning process. Here, we propose a faster and more efficient approach to construct the episodicwise metric for each task. Concretely, we first reformulate the Eq. (9) as follows:
(10) 
Based on the Lemma (1), we get the optimal solution by letting and assuming ,
(11) 
And the assumption of can always hold if we pick up a positive definite matrix as prior metric . In addition to considering the pairwise constraints and regularization prior, we further introduce the feature correlation information into the final metric with the task covariance matrix , which results in:
(12) 
where , , are positive tradeoff parameters and , are from Eq. (6). Moreover, with the insights of transduction, we calculate the task covariance matrix with both the support set and query set in each episode. Obviously, the Eq. (12) only involves simple matrix operations, such as inversion and transpose, which is more efficient than SGDbased optimizers and naive SDP solvers. In addition, as a symmetric positive definite matrix, another insights into the nature of the learned episodicwise metric is an adaptive linear projection layer by expressing as , then the in Eq.(3) can be formulated as , where is a taskspecific transformation matrix.
Lemma 1
Let be two symmetric positivedefine matrices of the same size, then the function is minimized uniquely by:
Proof. See supplementary material for more details.
3.4 Bidirectional Similarity
Assuming all samples have been transformed into a taskspecific embedding space with the learned feature extractor and the episodicwise adaptive metric in Eq. (12), we then perform a novel bidirectional similarity strategy (BiSIM) to calculate the probability that each query belongs to each category. In detail, after shrinking the the support set to a prototype set with Eq.(7), we formulate the positivedirection similarity between query and each prototype with the softmax function:
(13) 
Most previous methods used this similarity as the final probability of the query belonging to each category. However, taking the entire query set into account with transductive inference, we further compute the probability of prototype belonging to each query with the following equation:
(14) 
We termed as the negativedirection similarity which can be interpreted as an attentionbased weight of the prototype over the whole query set . At last, we perform the product of and as the final bidirectional similarity (denoted with BiSim) between the query and the prototype , i.e. . Essentially, the basic idea behind the BiSim strategy is that if a query is similar to one prototype and the prototype is also similar to the query, then we argue that they are more matching with each other. Without increasing any computational burden or requiring any human interaction, our proposed strategy calculates more robust similarity efficiently.
4 Experiments
In this section, we detail our experimental setting and compare TEAM with stateoftheart approaches on three challenging datasets, i.e. miniImageNet [35], Cifar100 [14] and CUB [36], which are widely used as fewshot classification benchmarks in the literature.
4.1 Datasets
miniImageNet. The miniImageNet dataset is the most popular benchmark in fewshot learning community, which is proposed by [35] originally. This dataset is composed of 100 classes selected from ImageNet [15] randomly, and each class has 600 images, which are resized to pixels for fast training and inference. Note that we follow the setup provided by [24] which splits the total 100 classes into 64 classes, 16 classes and 20 classes for training, validation and evaluation respectively. The validation set is only used for tracking model generalization in all experiments.
Cifar100. The Cifar100 [14] is a simple dataset for image classification and consists of 100 categories, each having 600 RGB images (). We further split the whole dataset into 64 classes as seen categories for training, 16 and 20 classes for validation and testing respectively [40]. Compared with miniImageNet, Cifar100 keeps the simplicity of dataset and decreases the inference complexity.
CUB. CUB [36], which is a benchmark dataset for finegrain classification initially, is composed of 11788 images over 200 birds classes. Follow the same partition as [9], we use 100 classes for training, and another two 50 classes as unseen classes for validation and evaluation. And all images are cropped with the provided bounding box [33].
4.2 Experimental Settings
Backbone Networks. For fair and comprehensive comparison with previous baselines, we employ two backbone networks as our embedding function. 1) A fourlayers convolution network (ConvNet) and 2) A standard deep residual network (ResNet18) are widely adopted in the fewshot learning literature. Specifically, the ConvNet contains 4 repeated convolutional blocks, where each block is composed of a convolution layer with 64 filters ( kernel), a batch normalization layer [11]
, a ReLU nonlinearity and a maxpooling layer with size
. In addition, we empirically add a global average pooling layer as last for accelerating the convergence of the model and reducing the dimensionality of the features. All inputs are resized to uniformly and the final output dimension is 256 for each image accordingly. For ResNet, we utilize the standard architecture proposed by [8] and remove the last fclayer for reducing parameters. Furthermore, all inputs are resized to like many previous works. After the last average pooling layer, it leads to a 512 vector for each image.Training Strategy. All backbone networks are optimized via SGD by Adam [13] endtoend on DGX1. Follow the strategy in [23, 26], we pretrain the ConvNet to classify all seen classes and utilizing the optimal weights for model initialization, and we train ResNet from scratch for simplicity. Moreover, We perform TIM strategy in all experiments and set and for . Inspired by [10], we start TIM strategy after 5000 episodes and intermittently disable it during training procedure, that is, performing task mixing for episodes and then close it for the next episodes. We empirically set and in our experiments. We decay the learning rate half every 10000 episodes and set the patience of early stopping as 20000.
Parameter Setup in . In Eq. (9), we set the tradeoff parameters as , , and the prior metric as in all experiments. In general, the choice of
is not fixed and has an important influence on generalization of the learned episodicwise metric. However, based on the following two observation, we argue the identity matrix is a quite natural choice for
. Firstly, learning from the Euclidean distance provides the most unbiased prior across all fewshot tasks, that is, assuming all features from the taskagnostic embedding space are equally scaled and equally relevant. Secondly, we observe that the optimal adaptive metric for each fewshot task is close to the identity matrix, which has been illustrated in Fig. 3 of the paper. Please zoom in the Fig. 3 or refer appendix for more details.4.3 Fewshot Learning Results
To verify the effectiveness of our approach for fewshot classification, we compare the proposed TEAM framework with our reimplemented baseline (ProtoNet [29]) and many stateoftheart methods in various setting on three benchmark datasets (miniImageNet, Cifar100 and CUB). For fair comparison with previous works, we focus on two popular fewshot learning settings, namely 5way 1shot and 5way 5shot tasks, which both contain 15 queries per episode for validation. In addition to the above setup, we also experimente with the transductive setting in all datasets, where the model utilizes the entire query set in each task. Specifically, we consider two types of transduction in our experiments. 1) Transductive batch normalization [5, 21], which shares information between all test examples via batch normalization layer, denoted with BN in all tables and 2) explicit transduction, which is first introduced into the fewshot learning by [18]
. Moreover, to make the evaluation more convincing, we report the final mean accuracy over 1000 test trails for all experiments and present 95% confidence intervals of all results in our supplementary material. Please refer our appendix for more complete results.
Results on miniImageNet. Experimental results on miniImageNet are shown in Table 1, where we can see that our model achieves stateoftheart performance with ConvNet backbone and competitive results with ResNet architecture. We reimplement ProtoNet as our baseline with the simple pretrain strategy proposed by [23], and achieve better performance than previously reported ones in [29]. Taking ConvNet as an example, we get 51.68% and 68.71% for 5way 1shot and 5way 5shot respectively, which are slightly better than 49.42% and 68.20% in [29]. After applying the TEAM framework on the baseline, the performance has been further improved significantly. For example, the absolute promotion of TEAM over published stateoftheart is 1.06% for 1shot and 2.18% for 5shot, over our baseline is 4.89% and 3.33% respectively. Note that the comparisons with PFA [23], LEO[26] and TADAM [22] are a bit unfair since we train the TEAM(ResNet) without any pretrain weights or including a classification objective, however, our model still achieves the best performance on 1shot task.
Model  Tran.  5Way 1Shot  5Way 5Shot  

ConvNet  ResNet  ConvNet  ResNet  
MatchNet [35]  No  43.56    55.31   
MAML [5]  BN  48.70    63.10   
MAML+ [18]  Yes  50.83    66.19   
Reptile [21]  BN  49.97    65.99   
ProtoNet [29]  No  49.42    68.20   
GNN [6]  No  50.33    64.02   
RelationNet [38]  BN  50.44    65.32   
PFA [23]  No  54.53  59.60  67.87  73.74 
TADAM [22]  No    58.50    76.70 
adaResNet [20]  No    56.88    71.94 
LEO [26]  No    60.06    75.72 
TPN [18]  Yes  55.51  59.46  69.86  75.65 
Baseline (Ours)  No  51.68  55.25  68.71  70.58 
TEAM (Ours)  Yes  56.57  60.07  72.04  75.90 
Results on Cifar100. Next we turn to the rich experiments evaluated on the Cifar100, and all results are shown in Table 2 for comparison in detail. Note that all results of MatchNet [35], MAML [5] and DEML [40] in Table 2 refer to the reported performance in [40]. Compared with our baseline, whose accuracy is slightly higher than previous points, we notice that our TEAM(ConvNet) increases by 6.24 % on 1shot tasks and 2.65 % on 5shot tasks, which demonstrates the effectiveness of our approach.
Results on CUB. The CUB dataset [36] is proposed for finegrain recognition initially and also widely utilized in current fewshot classification literature. From Table 3, we observe that the performance of our baseline is far better than the previous ProtoNet [29], that is because we preprocess all images with the provided bounding boxes [33] to reduce the impact of background on final performance. Comparing the TEAM with our reimplemented baseline, both ConvNet and ResNet backbones achieve outstanding performance over 1shot and 5shot tasks.
Further Analysis. All results which are summarized in the Table 1  3 indicate that our approach can consistently improve the fewshot learning performance on different datasets. This confirms that, under the setting of transductive inference, our model can efficiently tailor an episodicwise adaptive metric for each task and perform a suitable similarity between all samples. Furthermore, we notice that the performance promotion of our approach in 1shot scenario is more significant than that in 5shot. This observation agrees with the nature of transduction [12, 18], where more training data are available, the less performance improvement will be. With regards to this, we then perform 5way shot (k=1, 3, 5, 7, 9) experiments on miniImageNet and all results are shown in Table. 6. As the number of shots increases, we notice that our TEAM consistently outperforms our baseline with a large margin, but the performance improvement from TEAM decreases slightly, which further verifies the above analysis about transductive inference.
Model  Tran.  5Way 1Shot  5Way 5Shot  

ConvNet  ResNet  ConvNet  ResNet  
MatchNet [35]  No  50.53    60.30   
MAML [5]  BN  49.28    58.30   
ProtoNet [29]  No  56.66    76.29   
DEML [7]  No    61.62    77.94 
Baseline (Ours)  No  57.83  66.30  76.40  80.46 
TEAM (Ours)  Yes  64.07  70.43  79.05  81.25 
Model  Tran.  5Way 1Shot  5Way 5Shot  

ConvNet  ResNet  ConvNet  ResNet  
MatchNet [35]  No  56.53    63.54   
MAML [5]  BN  50.45    59.60   
ProtoNet [29]  No  58.43    75.22   
RelationNet [38]  BN  62.45    76.11   
DEML [7]  No    66.95    77.11 
TriNet [4]  No    69.61    84.10 
Baseline (Ours)  No  69.39  74.55  82.78  85.98 
TEAM (Ours)  Yes  75.71  80.16  86.04  87.17 
4.4 Ablation Study
Effectiveness of Different Modules. According to the previous analysis, the proposed TEAM framework is far superior to our baseline and becomes the new stateoftheart approach in fewshot classification literature. As a necessary step to the ablation study, we first analyze how much each module (TIM, EAM and BiSim) contributes to the ultimate performance. All results are shown in Table 4 in great details. Note that our baseline (ProtoNet) uses the ConvNet as backbone network and achieves higher performance on all three datasets (see the second row in Table 4) than previous performance because of the pretrain weights initialization. Furthermore, we perform various setting of TEAM framework as follows. 1) TEAM adds the TIM strategy into our baseline, 2) TEAM utilizes the TIM strategy and episodicwise adaptive metric (EAM) simultaneously and 3) TEAM combines all three modules together to get the ultimate performance. By comparing the second and third rows in Table 4, we observe that the TIM strategy can consistently improve the performance of all fewshot tasks. Then we further compare the TEAM and TEAM, where the only difference between them is whether using EAM module. Taking 1shot task of miniImageNet as an example, TEAM is 2.38% higher than TEAM, which demonstrates that it is feasible to construct an episodicwise adaptive metric for each task in fewshot learning. And the last row of Table 4 further shows the effectiveness of our entire framework.
Model  miniImageNet  Cifar100  CUB  

1shot  5shot  1shot  5shot  1shot  5shot  
Proto [35]  49.42  68.20  56.66  76.29  58.43  75.22 
Proto (Ours)  51.68  68.71  57.83  76.40  69.39  82.78 
TEAM  52.97  70.45  59.56  77.65  70.27  84.68 
TEAM  55.35  71.59  62.76  78.80  75.06  86.06 
TEAM  56.57  72.04  64.07  79.05  75.71  86.04 
Methods  5way 1shot  5way 5shot 

Soft kMeans [25]  
Soft kMeans+Cluster [25]  
Masked Soft kMeans [25]  
TPNsemi [18]  
TEAMsemi (Ours)  54.81 0.59  68.92 0.38 
Comparison with semiSupervised Fewshot Learning. From the perspective of unlabeled data, transductive inference is a special case of semisupervised learning, that is, the former one directly uses test set as unlabeled data and the latter uses more auxiliary unlabeled data. As such, we propose a semisupervised version of the TEAM framework, namely TEAMsemi, to compare it with other semisupervised fewshot approaches. Specifically, following the labeled/unlabeled data split in [25], we use 40% and 60% in each class as labeled and unlabeled data respectively. Note that the support/query examples in each task are both randomly sampled from the labeled set only for fair comparison. All results, which are averaged over 10 random labeled/unlabeled partition of the training set, are reported in Table 5 in details. Compared with the previous stateoftheart approaches TPN [18], our TEAMsemi framework increases by 2.03% and 2.50% for 1shot/5shot respectively, which verifies its ability to handle both supervised and semisupervised fewshot classification.
Sparsity Nature of Episodicwise Adaptive Metric. In this section we explore the sparsity nature of the episodicwise adaptive metric in fewshot learning. Take a 5way 5shot task as an example, we set 15queries in each class and exploit the classic LMNN algorithm [37] with all support and query samples to optimize an oracle metric, which ensures all examples in this task can be completely distinguished. Then we scale all elements of the metric into the region [0, 1] and visualize its heatmap in Fig. 3 (left). We observe that diagonal elements always maintain larger values (close to red) than offdiagonal elements (close to blue). After reorganizing all values with numerical descending order in Fig. 3 (right), we further notice that there is a large value gap between the diagonal elements and offdiagonal elements. These practical observations indicate that, due to the lowdata setup, we cannot have enough prior to find accurate correlations between all dimensions, except strong selfcorrelation in diagonal, which leads to the sparsity nature of episodicwise adaptive metric. Moreover, from this practical viewpoint, we further verify that it is reasonable to set the identity matrix as prior metric in Eq. (12).
Methods  1shot  3shot  5shot  7shot  9shot 

Baseline (Ours)  51.68  63.87  68.71  71.28  73.35 
TEAM (Ours)  56.57  67.64  72.04  73.47  75.04 
Accuracy (+)  4.89  3.77  3.33  2.19  1.69 
5 Conclusions
We have proposed Transductive Episodicwise Adaptive Metric (TEAM) for fewshot learning, which is a simple and efficient framework based on metalearning. It not only learns a shared embedding model across all tasks endtoend but also further tailors an episodicwise metric by taking more distinctive information within each task into account. Moreover, with using the entire query set at once for inference, we leverage a bidirectional similarity strategy for extracting more robust relationship between queries and prototypes. Our TEAM achieves the stateoftheart performance on three fewshot benchmark datasets and is easily extended to semisupervised version. The extensions of TEAM on other fewshot approaches could be future work.
Acknowledgement.
This work is partially supported by grants from the National Key R&D Program of China under grant 2017YFB1002400, the National Natural Science Foundation of China under contract No.U1611461, No.61825101 and No.61672072, also supported by grants from NVIDIA and the NVIDIA DGX1 AI Supercomputer, and Beijing Nova Program (Z181100006218063).
References
 [1] Aurélien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data. arXiv:1306.6709, 2013.
 [2] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [3] Olivier Chapelle, Jason Weston, Léon Bottou, and Vladimir Vapnik. Vicinal risk minimization. In NIPS, pages 416–422, 2001.
 [4] Zitian Chen, Yanwei Fu, Yinda Zhang, YuGang Jiang, Xiangyang Xue, and Leonid Sigal. Semantic feature augmentation in fewshot learning. arXiv:1804.05298, 2018.
 [5] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 [6] Victor Garcia and Joan Bruna. Fewshot learning with graph neural networks. In ICLR, 2017.
 [7] Bharath Hariharan and Ross B Girshick. Lowshot visual recognition by shrinking and hallucinating features. In ICCV, pages 3037–3046, 2017.
 [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
 [9] Nathan Hilliard, Lawrence Phillips, Scott Howland, Artëm Yankov, Courtney D Corley, and Nathan O Hodas. Fewshot learning with metricagnostic conditional embeddings. arXiv:1802.04376, 2018.
 [10] Hiroshi Inoue. Data augmentation by pairing samples for images classification. arXiv:1801.02929, 2018.
 [11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

[12]
Thorsten Joachims.
Transductive inference for text classification using support vector machines.
In ICML, volume 99, pages 200–209, 1999.  [13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [14] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[15]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.
In NIPS, pages 1097–1105, 2012.  [16] Yoonho Lee and Seungjin Choi. Gradientbased metalearning with learned layerwise metric and subspace. In ICML, 2018.
 [17] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Metasgd: Learning to learn quickly for few shot learning. arXiv:1707.09835, 2017.
 [18] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. Learning to propagate labels: Transductive propagation network for fewshot learning. In ICLR, 2018.
 [19] Tsendsuren Munkhdalai and Hong Yu. Meta networks. In ICML, 2017.

[20]
Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, and Adam Trischler.
Rapid adaptation with conditionally shifted neurons.
In ICML, pages 3661–3670, 2018.  [21] Alex Nichol, Joshua Achiam, and John Schulman. On firstorder metalearning algorithms. CoRR, abs/1803.02999, 2, 2018.
 [22] Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved fewshot learning. In NIPS, pages 719–729, 2018.
 [23] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan Yuille. Fewshot image recognition by predicting parameters from activations. In CVPR, volume 2, 2017.
 [24] Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. ICLR, 2016.
 [25] Mengye Ren, Eleni Triantafillou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B Tenenbaum, Hugo Larochelle, and Richard S Zemel. Metalearning for semisupervised fewshot classification. In ICLR, 2018.
 [26] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Metalearning with latent embedding optimization. In ICLR, 2019.
 [27] Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memoryaugmented neural networks. In ICML, pages 1842–1850, 2016.
 [28] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [29] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In NIPS, pages 4077–4087, 2017.
 [30] Jos F Sturm. Using sedumi 1.02, a matlab toolbox for optimization over symmetric cones. Optimization methods and software, 11(14):625–653, 1999.
 [31] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.
 [32] Lorenzo Torresani and Kuangchih Lee. Large margin component analysis. In NIPS, pages 1385–1392, 2007.
 [33] Eleni Triantafillou, Richard Zemel, and Raquel Urtasun. Fewshot learning through an information retrieval lens. In NIPS, pages 2255–2265, 2017.

[34]
Vladimir N Vapnik.
An overview of statistical learning theory.
IEEE transactions on neural networks, 10(5):988–999, 1999.  [35] Oriol Vinyals, Charles Blundell, Tim Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In NIPS, pages 3630–3638, 2016.
 [36] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltechucsd birds2002011 dataset. Technical Report CNSTR2011001, 2011.
 [37] Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. In NIPS, pages 1473–1480, 2006.
 [38] Flood Sung Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for fewshot learning. In CVPR, 2018.
 [39] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. In ICLR, 2018.
 [40] Fengwei Zhou, Bin Wu, and Zhenguo Li. Deep metalearning: Learning to learn in the concept space. arXiv:1802.03596, 2018.
Comments
There are no comments yet.