A Two-Stage Approach to Few-Shot Learning for Image Recognition

12/10/2019
by   Debasmit Das, et al.
Purdue University
18

This paper proposes a multi-layer neural network structure for few-shot image recognition of novel categories. The proposed multi-layer neural network architecture encodes transferable knowledge extracted from a large annotated dataset of base categories. This architecture is then applied to novel categories containing only a few samples. The transfer of knowledge is carried out at the feature-extraction and the classification levels distributed across the two training stages. In the first-training stage, we introduce the relative feature to capture the structure of the data as well as obtain a low-dimensional discriminative space. Secondly, we account for the variable variance of different categories by using a network to predict the variance of each class. Classification is then performed by computing the Mahalanobis distance to the mean-class representation in contrast to previous approaches that used the Euclidean distance. In the second-training stage, a category-agnostic mapping is learned from the mean-sample representation to its corresponding class-prototype representation. This is because the mean-sample representation may not accurately represent the novel category prototype. Finally, we evaluate the proposed network structure on four standard few-shot image recognition datasets, where our proposed few-shot learning system produces competitive performance compared to previous work. We also extensively studied and analyzed the contribution of each component of our proposed framework.

READ FULL TEXT VIEW PDF

Authors

page 1

page 8

page 13

08/18/2020

Dataset Bias in Few-shot Image Recognition

The goal of few-shot image recognition (FSIR) is to identify novel categ...
10/22/2020

Few-shot Image Recognition with Manifolds

In this paper, we extend the traditional few-shot learning (FSL) problem...
06/12/2017

Few-Shot Image Recognition by Predicting Parameters from Activations

In this paper, we are interested in the few-shot learning problem. In pa...
01/13/2021

Learning to Focus: Cascaded Feature Matching Network for Few-shot Image Recognition

Deep networks can learn to accurately recognize objects of a category by...
03/27/2019

Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibration

Zero-shot learning (ZSL) for image classification focuses on recognizing...
11/21/2019

Knowledge Graph Transfer Network for Few-Shot Recognition

Few-shot learning aims to learn novel categories from very few samples g...
11/27/2018

Tackling Early Sparse Gradients in Softmax Activation Using Leaky Squared Euclidean Distance

Softmax activation is commonly used to output the probability distributi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

For the past decade, deep convolutional neural networks (CNN) have produced excellent results in visual recognition tasks such as object recognition, scene classification, etc. 

[9, 28, 66]. A CNN learns to recognize a large quantity of visual categories by training on a large collection of annotated images using a gradient-descent technique [32]. Although the training procedure is computationally intensive, it can be parallelized using a Graphics Processing Unit (GPU). Even after a long training period, the CNN can only recognize a fixed set of image categories. To learn to recognize novel categories, one has to collect new training data and re-train the CNN model with further adjustments. Unfortunately, in some cases, there might not be enough labeled data available for training a novel category. This results in a long-tailed distribution of object categories [67] as shown in Fig. 1. In such a long-tailed distribution, only a few object categories occur frequently. Thus, we can obtain lot of samples from these categories. However, there are lots of categories which occur very rarely. For these object categories, we can obtain only a very few samples. As an example, a crow is a bird that we see very often. Therefore we can collect lot of crow samples with sufficient variability. On the other hand, samples of a rare bird kakapo are very difficult to obtain.

Fig. 1: Object categories follow a long tailed distribution with a lot of rare classes and very few common classes.

Research on learning novel categories from a few samples is termed few-shot learning. Most previous methods tackle few-shot learning by assuming access to a large labeled training database as base categories. Using this large database, the goal of few-shot image recognition systems is to recognize any novel category accurately from just a few samples of that category.

Traditional supervised learning using a few samples for training often causes overfitting and results in poor generalization. The poor performance in generalization is due to the following reasons: Firstly, it is related to the fundamental problem known as the curse of dimensionality. The sparsity of the feature volume due to less number of samples in such a high-dimensional image feature space aggravates the problem of overfitting. Secondly, the use of only a few training samples would not be able to represent the overall variation of a class. Hence, the true spread of the class distribution remains unknown and the classification boundaries are poorly estimated. Also, the few training samples of a class might be sampled near the edge of the class distribution. As a result, the mean of these training samples would not be close to the true mean of the class. Therefore, the mean would not accurately represent the location of the class in the feature space, resulting in mis-classification.

In this paper, we propose solutions to each of the above problems. Firstly, to address the problem of high-dimensionality, we propose a low-dimensional discriminative space called the relative-feature space. In this space, the relative feature of a sample is represented as a vector of distances between the training samples in a training batch. Since the number of training samples is less, the dimensionality of this relative feature space will be a lot less than the dimensionality of the original absolute feature space. Also, the features will be discriminative since instances from the same classes are expected to cluster and would have similar pairwise inter-class and intra-class distances. Additional benefit of using these relative features is that they extract second-order structural information about the dataset to assist recognition. Using higher-order features beyond the second-order relative features would not have the added benefit of having a low-dimensional feature space. Therefore, the combination of relative features and absolute features presents better performance in recognition. Secondly, to address the uncertain variance of categories, we propose a trainable neural network (NN) as a module to predict the variance of each category. Finally, we propose to learn a category-agnostic transformation from the class-mean representation to the class-prototype representation. As a result, more accurate locations of the class can be obtained from the mean of a few samples.

The contributions of this paper are both at the feature-extraction and classification stage of the few-shot object recognition system. They can be summarized as follows: (a) A novel relative-feature descriptor in combination with the original absolute deep-feature descriptor for object recognition, (b) A framework for learning class variances in order to compute the Mahalanobis distances to class prototypes, (c) Additional training pipeline in order to learn a category-agnostic transformation from the class-mean representation to the class prototype. The training of the two stages has not been carried out jointly since the category-agnostic transformation assumes that a robust representation has already been learned for the images. Finally, we have conducted extensive experimentation and analyses on four standard datasets to verify the validity of the proposed two-stage few-shot learning framework for image recognition.

This paper is organized into five sections. Section II discusses related work and Section III describes our proposed approach. Section IV provides experimental results and discussion. This is followed by conclusion and future direction in Section V.

Ii Related Work

The field of few-shot learning has shown increased interest in the past decade. Most of the earlier methods used a Bayesian approach of introducing priors to facilitate the few-shot learning. Li et al. [13] used a global prior while Salakhutdinov et al. [50] used a super-category-level prior. For application-specific tasks like handwriting recognition, generative models have been proposed that can produce characters from parts [64] or strokes [30]

. For object recognition, a hierarchical Bayesian program has been proposed to utilize compositional and causal approaches to create a probabilistic generative model for visual objects 

[31, 29]. Some ad-hoc approaches to address few-shot learning were to carry out data augmentation by harnessing unlabeled data [4], by transformation and adding noise [5, 11], and by synthesizing artificial examples [18, 19, 61, 35] or using compositional representations [68, 10]. More recent methods that used generative modeling include the auto-encoder [52] and variations of adversarial-network-based architectures [65, 15]. However, most of these generative methods require lots of efforts to generate data, otherwise the generated data do not represent the actual data distribution properly. Thus, recent methods mostly take a metric-learning or a meta-learning approach to few-shot learning.

Metric learning approaches strive to preserve class neighborhood structure; that is, the representations are learned such that features from the same class are clustered together while features from different classes are kept far apart. As a result, novel-class features are expected to have more room for classification error. Koch et al. [26] used Siamese Networks to match a training example of a novel category to a test example. The training was carried out using an object recognition dataset. Vinyals et al. [59]

proposed Matching Networks, which used a nearest-neighbor classifier in addition to an attention mechanism over the training samples. Prototypical Networks 

[55] extended nearest-mean classifiers [36] and learned to classify query samples by computing Euclidean distances to prototype features. As an extension to Prototypical Networks, Sung et al. [57] learned a distance metric instead of using a predefined distance function. A more recent method [43] used a metric learning approach, where the metric is scaled and adapted based on the task.

On the other hand, meta-learning methods for few-shot learning use a learning-to-learn scheme, where a model extracts useful transferable knowledge about the learning procedure from a large collection of tasks. This helps in quickly learning the novel task which, in our case, is image recognition for novel categories. Ravi and Larochelle [48]

used Long-Short-Term Memory (LSTM) 

[22] to train a meta-learner to produce model parameter updates for optimization of a base learner on a task. This method basically learns the optimization procedure using data from a number of auxiliary tasks. The work on learning-to-learn [1] approach to few-shot learning is also closely related to the learning-to-optimize technique. Finn et al. [14] built upon this work to focus on learning the initial parameter for gradient descent so that the learner can be optimized for a new task in a few iterations. Mishra et al. [38] introduced temporal convolutions to predict the label of a test example, given a sequence of labeled samples and the unlabeled test sample. The transductive propagation network [33] classifies the whole test dataset using a graph-based label propagation mechanism. They use an end-to-end meta-learning framework to learn the feature embedding and graph construction simultaneously. Sun et al. [56] used a meta-transfer learning mechanism that shifts and scales neural network weights for new tasks. Similarly, Munkhdalai et al. [40]

proposed a meta-learning scheme that shifts the neuron activations depending on task-specific parameters.

Alternatively, few-shot learning methods include memory-based models [51, 39, 47] that store selective relevant information and use that for comparison at test time. Attentive comparators [54] compare patches of images sequentially through an attention mechanism and then arrive at a prediction. Qiao et al. [46] learned a category-agnostic mapping from activations to parameters that allowed fast generalization to novel categories. A similar idea [45] was used to imprint weights for the classification layer of the novel categories. Bertinetto et al. [2]

used a differential closed-form solver based on ridge regression for fast adaptation to novel categories. Some methods extended existing machine learning concepts like graph neural networks 

[16] and information retrieval [58] to few-shot learning. For a more comprehensive survey on few-shot learning, one can refer to [53, 60].

Fig. 2:

Overall framework for the proposed approach for a 3-way 1-shot inference scenario. A single image from each of the 3 classes (classes are shown in different colors) are used as support examples while a single query image is used. The output is the probability of the query example belonging to each of the 3 classes.

Iii Proposed Approach

Iii-a Problem Definition and Formulation

Our proposed few-shot learning method has both metric-learning and meta-learning components, which are learned in two stages. The metric-learning stage learns both absolute and relative feature sets and then uses the Mahalanobis distance metric to compute class labels of the test sample. The idea of using relative features stemmed from our prior work in domain adaption [7, 8, 6]. Domain adaptation considers adaptation between labeled source-domain data and unlabeled target-domain data but with the same categories in both domains. The meta-learning stage learns auxiliary knowledge for classification, which is a transformation from a sample to its corresponding class prototype. This idea is related to the work of Wang et al. [62], where they learned to transform small-sample-model parameters to large-sample-model parameters. The work on few-shot learning without forgetting [17] also used a category-agnostic transformation but with a different distance metric and without any procedure to avoid negative transfer. The overall framework of our proposed few-shot learning approach is shown in Fig. 2.

Our proposed few-shot learning image recognition system is trained using a large database of base categories, which consists of a large number of samples from each category. Each of these categories contains a large amount of data that we can use to learn some useful generalizable knowledge. This knowledge should help the recognition of novel categories for which only a few labeled samples per category are available.

The knowledge can be learned using traditional supervised learning, where training is generally carried out by feeding instances from the base categories in the form of mini-batches and then optimizing some loss function. The model is generally tested on the same set of categories on which it is trained. If we want the trained model to work on novel categories, then the model can be fine-tuned on the new training dataset 

[42]. However, the procedure of fine-tuning might not work if the novel categories have very few samples in each category. In fact, the fine-tuning procedure might cause the model to overfit on the few training samples, causing it to under-perform on novel category test samples. The main reason for overfitting is that the number of training samples per category is much less compared to the dimensionality of the feature space and therefore the variance of the few samples is inaccurate to capture the distribution of the class.

We address these shortcomings of high dimensionality and variable variance by proposing the use of relative features, variance estimator and category-agnostic transformation. Still, the traditional training procedure involving mini-batches from a large dataset would not be able to produce a satisfactory model since it does not simulate the test condition well. Each test category contains only a few samples and extracting mini-batches for training is impossible. Hence, an episodic training strategy inspired from [59] needs to be deployed.

In episodic learning, the set of few labeled samples available from each of the novel categories is known as the support set. The set of unobserved testing samples of the novel categories is often called the query set. If the support set were large, we could have just trained the model on the support set. However, since the support set is small, traditional training of a model would result in over-fitting and consequently the model would produce unsatisfactory performance on the testing data. However, the episodic training strategy can avert poor performance by simulating the test conditions. In each training episode, we first select classes randomly from among the base categories. From each of those selected classes, we randomly select and disjoint samples from it. This sampling strategy is called the -way -shot sampling strategy. In general, is same as the number of support samples present per novel category. is user-specified and is generally set in the range of 5 to 15 per category. Using this -way -shot sampling strategy, we form the training support set , where , and also the training query set , where . In the training episode, the support set is used to represent the class while the query set is used for the evaluation.

Iii-B Relative-Feature-Space Representation

The first step of our proposed few-shot learning framework requires feature extraction from the raw samples. This is done by feeding the support set samples from and the query set samples from through the feature extraction module to produce the embeddings and , respectively. The dimensionality of this absolute feature map is very large compared to the total number of support and query samples. This sparsity in the number of samples compared to the dimension volume generally leads to over-fitting and poor generalization performance. To address this dimensionality problem, we propose the relative-feature-space representation, which has a dimensionality comparable to the total number of support and query samples in an episode. The dimensionality of this relative feature space will therefore be much less than the original absolute feature space.

The relative feature of a sample in an episode is computed by calculating the squared pairwise Euclidean distance with itself and to all other samples in the episode. Hence, if there are samples in an episode, counting all support and query samples regardless of the categories, then the dimension of the relative feature of a sample is given as

(1)

where and is the Euclidean norm. Note that for . The dimensionality of this relative feature map is therefore . Since this relative feature-space dimensionality is comparable to the number of samples and that these features contain important structural information about the data, we expect that the inclusion of this feature would increase few-shot testing performance.

In Fig. 3, we show a simple example on how to compute the relative-feature representation from the absolute-feature representation. Consider that there are three image samples – , and in an episode whose absolute-feature representations are , , and , respectively. They are pairwise separated through Euclidean distances of 1, 2 and 3 as shown in the figure. From Eq. (1), the relative-feature representation is obtained by squaring the pairwise Euclidean distances. Since there are three points in the episode, these points will lie in a three-dimensional relative-feature representation space and they would be represented as , and .

Fig. 3: This figure shows an example on how the low-dimensional relative-feature representation is computed from the original high-dimensional representation space. The original high dimensional feature space contains three data points. Accordingly, we would obtain a three-dimensional feature space if we compute pairwise distances of a data-point with itself and other points.

Iii-C Variance Estimation

After embedding the support and query points in the absolute-feature space () and the relative-feature space (), our goal is to use these features for classification. We do not want to tie our model to any category. We want to make our model generalizable to novel categories and therefore we do not use a classification layer that is commonly used for traditional neural networks. Instead, a nearest-class-mean approach is used [36], where the query point embeddings are compared to the prototype representation of each class. The prototypes of a class and can be found by averaging the embedded support points of its class for both the absolute and relative representations, respectively, as follows

(2)
(3)

where is the set of samples of the support set , which belongs to class

. Using these prototypes, we can proceed to calculate the probability distribution over classes

and for a query point . This is done using the softmax operation with distance metrics and for the absolute and relative representations, respectively, as follows

(4)
(5)

where the summation is over all the classes present in the episode. In Eqs. (4) and (5), the distance metrics and need to be defined in order to compute the probability distributions. Snell et. al [55] compared cosine and Euclidean distances and found Euclidean distance to perform better for few-shot testing. They argued that the Euclidean-distance metric is an example of Bregman Divergence. As a result, prototype computation and inference can be thought of as performing a mixture density estimation with exponential family distributions. However, if the Euclidean distance is used, we assume that all the classes have the same spread in the embedding space. This assumption may lead to poor classification performance because all the classes may not have the same variance. Thus, we propose to use the Mahalanobis distance to measure and include the spread of each class in the classification scheme.

The Mahalanobis metric measures the distance between a data point and a distribution . If the distribution has an associated mean and an invertible covariance matrix , then the Mahalanobis distance is calculated as

(6)

where is the inverse of the covariance matrix . In case the distribution is spherically Gaussian with a variance for all the feature dimensions, the Mahalanobis distance is reduced to

(7)

where

is an appropriate identity matrix.

The importance of using the Mahalanobis distance over the Euclidean distance is illustrated in Fig. 4 in which

Fig. 4: This figure shows an example where different classes can have different variances. As a result, the Mahalanobis distance maybe preferred over Euclidean distance for classifying a test query point into one of these classes.

we have three classes with prototypes centered at , and

. The spread of the classes is quantified through the standard deviations

, and . The goal is to classify the query points , and into one of the three classes. If we use the Euclidean distances for comparison, point would yield equal probabilities for classes 1 and 2 since the point is equidistant from those classes. This classification does not take into consideration that the spread of class 1 is more than the spread of class 2; that is, . If we use the Mahalanobis distance, , and accordingly the query point will yield a higher probability for class 1. Similar treatment can also be applied to query points and .

In our model, we expect each class to have its own covariance matrix . Therefore, there is a need to model the covariance as a function of each class’s prototype. However, the covariance matrix is very high-dimensional, requiring lots of parameters to model it. Furthermore, the covariance matrix

is required to be positive definite, the constraints of which need to be satisfied strictly. Hence, we settle with using a spherical Gaussian distribution with the same variance for all the feature dimensions. Since we let the class variance be a function of the class’s prototype, we can write

(8)

where and are the variance and prototype of class , respectively. This concept of predictable variance may be difficult to grasp initially. However, one can think of it as curve fitting of a function, where the input is the prototype and the output is the variance of the corresponding prototype. The corresponding function is fit using lots of data available from the base categories. Since we expect the function to be smooth, prototypes closer to each other should produce similar variances. After training is over, this function can then be used to predict the variance of novel-class prototypes. The variance estimating function can therefore be implemented by a neural network. Hence, using Eqs. (7) and (8), the distance metric in Eq. (4) can be expressed as the square of the Mahalanobis distance as follows

(9)

For the relative-feature space, the concept of having a variance does not have any physical meaning. As a result, we just use the square of the Euclidean distance metric for such that

(10)

The representation is learned by minimizing the negative log-probability averaged over all the query points. The negative log-probability of a query point is given as

(11)

where and are composed of all the trainable parameters of the feature extractor () and the variance estimator (), respectively, and is a hyper-parameter for the regularization in Eq. (11). The negative log-probability averaged over all the query points in the batch needs to be minimized.

Iii-D Category-agnostic Transformation

After the feature-extraction model and the variance estimator are trained, we proceed to the next stage of training. In this training stage, we propose to find a category-agnostic transformation from a mean-sample representation of a class to the prototype representation of the corresponding class. Learning this transformation is important because the novel categories have very few support samples and so the mean-sample representation will not accurately represent the prototype. The existence of this category-agnostic transformation may be questionable. However, previous work by Wang et al. [62] suggested the existence of a similar transformation. In that work, the authors proposed the existence of a transformation between model parameters trained using less number of samples to model parameters trained using large number of samples. Since model parameters and samples are dual of each other, we conjecture the existence of a transformation between the mean-sample representation and the prototypes. We next determine this category-agnostic transformation and the factors that this transformation depends on.

Fig. 5: Example depicting the choice of factors affecting the category-agnostic transformation from a support data-point to the corresponding prototype.

In addition to the mean-sample representation, the location of the novel-class prototype would also depend on the nearby base-class prototypes. This is illustrated through an example in Fig. 5 in which we have one support sample point for a novel class. But this support data-point may not always be able to represent a class prototype because it might be present on the edge of the distribution as in this example. The transformation function mapping the support point to the unknown class prototype should depend on the support point as well as on the nearby similar base categories. This is because the neighboring class prototypes condition the possible locations of the novel-class prototype. In this example, base classes 1 and 3 form the neighboring categories on which the location of the novel-class prototype should depend. Base class 2 is far from the novel class in the feature space and therefore it should have little effect on the location of the novel-class prototype. We next describe the construction of the transformation function .

The prototype of a novel category depends on the mean-sample representation and the base-category prototypes collected in , where consists of the base-category prototypes stacked vertically in a matrix, and is the dimension of the absolute-feature space in which the prototypes lie. Ideally, the prototype matrix should be calculated using the base categories. Since each base category has a large number of samples, the mean representation will be used as an accurate estimate of the prototype. Thus, the prototype of a novel class can be represented as

(12)

where is the mean-sample representation of the novel class . We can decompose the function into two functions, , where is the contribution due to the mean-sample representation and is the contribution due to the base-class prototypes . Since the contribution of the base-class prototypes depends on the closeness of to the prototypes in , will also depend on . We next discuss the construction of functions and .

Contribution of novel-class samples using residual connection. The function is a complex non-linear function that transforms the mean-sample representation towards the prototype . In case the number of samples in the novel category is large, should identically map to . Hence, it is important for the function

to model identity mappings. Residual connections and networks have been shown to model identity functions smoothly 

[20]. In our case, the corresponding meaningful residual connection will be , where is a bias term and does not have a scaling effect on the mean-sample representation. Thus, if we include a scaled residual connection, then

(13)

where is the scaling matrix. Letting , the bias term will be a complex non-linear term and can be modeled using a multi-layer neural network.

Contribution of the base classes. The function models the contribution of base-class prototypes to the novel-class prototype. Base classes that are similar to the novel class will have more contribution. This similarity can be measured in terms of Euclidean distance between a novel class mean-sample representation and a base-class prototype. The contribution of a base class to a novel class is quantified through a probability distribution,

(14)

where is the prototype belonging to the base class. The computation of probability is carried out for all the base classes . These are stacked together to form a probability vector for the novel class . After that, we use a threshold on the probability vector . Only the elements above the threshold are kept while other elements are set to zero. This thresholding step is important as it ignores the effect of base classes that have very little contribution to the novel classes. From the feature-space perspective, novel classes that are distant from the base classes are ignored. This step is our attempt to prevent negative transfer [44], where irrelevant base classes contributing to learning novel-class recognition will reduce the recognition performance. The thresholded probability vector is set as . This is used to combine the base-class prototypes such that

(15)

where is the scaling matrix. The factor linearly combines the contributing base-class prototypes. The presence of is important in scaling the effect of this term to the whole transformation function . Next, we discuss the procedure to learn this category-agnostic transformation , using the large labeled dataset available from the base categories.

Training Strategy. In the second stage of training, we follow the episodic training strategy similar to the first stage. In each training episode, we randomly sample categories from among the categories. We call these categories as pseudo-novel categories. We refer to the remaining categories as pseudo-base categories. The goal of this training strategy is to simulate the testing scenario where we have novel classes as well as already known base classes.

In a training episode, the prototypes of the pseudo-base categories are calculated using the mean-sample representation. These prototypes can be stacked together to form the prototype matrix . For each pseudo-novel category, we randomly select and disjoint samples. From this, we form the training support set , where and also the training query set , where . For a category belonging to one of the categories, we calculate the corresponding class prototype using Eqs. (12)-(15). Using this modified prototype , we can proceed to calculate the probability distribution over classes for a query point . This is done using the softmax operation with the Mahalanobis distance metric as described previously

(16)

where the summation is over all the pseudo-novel classes present in the episode.

The training is carried out by minimizing the negative log-probability averaged over all the query points. The negative log-probability of a query point is given as , where consists of the scaling matrices , and all the trainable parameters of the residual network . We also include a regression-based regularization involving the ground truth and predicted prototypes of these pseudo-novel classes. If the ground truth prototype of class is and the predicted prototype is , then the corresponding regularization becomes . This regularization is averaged over all the prototypes of pseudo-novel classes. The regularization coefficient is set as .

Given: Base category training data where each . is a subset of containing elements from class
Parameters:
Randomly initialize parameters of feature extraction () and variance estimation ()
for each episode
for
  
  
  
  
  
end for
for
  for
   
      
      
      
  end for
end for
Take gradient step of with respect to
end for
First training stage ends and second training stage starts.
Randomly initialize parameters of category-agnostic transformer ()
for each episode
[Form pseudo-base prototypes]
[Form pseudo-novel prototypes]
for
  
  
  
end for
for
  for
   
      
      , where
  end for
end for
Take gradient step of with respect to
end for
Algorithm 1 Proposed two-stage few-shot learning procedure.

After the training is done, testing is also carried out in an episodic fashion. For each episode, we randomly sample classes from the novel test classes. From each novel class, support samples and query samples are drawn randomly. The class prediction for a query point is given as the class which minimizes . The overall training procedure of the proposed two-stage few-shot learning method is provided in Algorithm 1.

Fig. 6: Instances of the dataset used in our experiment for (a) Omniglot, (b) miniImagenet, (c) CUB-200, and (d) CIFAR-100.

Iv Experimental Results

Iv-a Datasets

To evaluate our proposed few-shot learning approach, we performed experiments on four datasets – the Omniglot [29], the miniImagenet, the CUB-200 [63] and the CIFAR-100 [27] datasets. These datasets provide a large variety of category-level granularity, image resolution and categories to test upon. The Omniglot dataset consists of 1623 handwritten characters taken from 50 alphabets. Each character has 20 examples associated with it. Each example is written by a different person, resulting in sufficient intra-class variation. According to the procedure of Vinyals et al. [59], the images are resized to . Each character class is augmented with more samples by having rotations in multiples of 90 degrees. So around 1200 character classes (total of 4800 including rotations) are chosen as the training (i.e., base) categories and the remaining classes are chosen as the testing (i.e., novel) categories. The miniImagenet dataset is a subset of the ILSVRC-12 dataset [49]. It consists of RGB color images of size , consisting of 100 classes with 600 examples in each class. The 100 classes are divided into 64 for training (base), 16 for validation and 20 for testing (novel).

Fig. 7: Network architecture used for different modules , and . (a) For the Omniglot dataset, produces a dimensional feature map from a dimensional input image. The module produces a scalar variance from the feature map. The regresses a 64-dimensional output from the feature map. (b) For the miniImagenet dataset, produces a dimensional feature map from a dimensional input image. The module produces a scalar variance from the feature map. The regresses a 1600-dimensional output from the feature map.

The CUB-200 and CIFAR-100 datasets have been introduced long before but have only recently been used as a benchmark for few-shot learning algorithms. The CUB-200 dataset is a fine-grained dataset consisting of 11,788 images of size , distributed across 200 categories of bird species. Using the class splits in [21], we have 100, 50 and 50 categories used for training, validation and testing, respectively. The CIFAR-100 dataset consists of 60000 low-resolution images of size . These images are distributed across 100 fine-grained categories or 20 coarse-grained categories. Using the class splits in [3], we have 64, 16 and 20 categories used for training, validation and testing, respectively. Figures 6(a), (b), (c) and (d) show some of the examples from the Omniglot, the miniImagenet, the CUB-200 and the CIFAR-100 datasets, respectively.

Iv-B Implementation

In this sub-section, we discuss the details of our neural network architecture and the training procedure. For the feature extractor module () of our trainable neural network architecture, we use four convolutional blocks. This feature extractor architecture is the same as used in previous works [55, 59]. This is done for the sake of fair comparison. Most of these previous works selected the feature-extraction architecture empirically. For shallow convolutional architecture and therefore more high-dimensional feature space, the performance is poor because the features extracted are not robust and not class-discriminative enough. But, as the depth of the convolutional architectures increases to a certain limit, we obtain a more informative low-dimensional feature space and therefore better recognition performance. The authors of [55, 59] experimented and found that the presented four-convolutional-blocks-based architecture is lightweight and optimal. Each of these blocks consists of a 64-filter

convolution layer with SAME padding, batch normalization layer, and an ReLU activation followed by a

max-pooling layer all stacked upon another. The batch normalization [23] results in better recognition performance because it prevents internal covariate shift. When a Omniglot image is applied as an input to these four convolutional blocks, its output results in a -dimensional feature space.

The variance estimator consists of two convolutional blocks. Each convolutional block consists of

convolution layer with SAME padding, batch normalization layer and an ReLU activation layer. The first and the second convolutional blocks consist of 32 and 1 filters, respectively. The last layer producing the variance has softplus operation as the activation function. This is selected to produce only positive outputs.

The transformation layer consists of three fully connected layers of 128, 96 and 64 dimensions. Except the last layer, all the layers contain batch-normalization and ReLU activation functions. The last layer does not have an ReLU activation so that it can provide both negative and positive transformation shifts as output. The overall architecture of all the modules used for the Omniglot dataset is shown in Fig. 7(a).

The neural-network structure was trained using the stochastic gradient descent variant

Adam [25] with an initial learning rate of . The first-stage training was carried out using 60-way 5-shot with 5 query points per episode. The higher way is chosen in training so that the model can learn a more difficult task of distinguishing more classes and therefore produce a more discriminative feature space. In this paper, the second-stage training episodic setup is always kept the same as the testing episodic setup for all the experiments; that is, if the testing setup is -way -shot, so is the second-stage training setup.

The hyper-parameters , , and were set to , , and , respectively. It is important to note that these hyper-parameters are kept fixed for a particular dataset. This is mainly because cross-validation is not always feasible for the few-shot learning setting, which contains only a few samples from the target category. Also, the validation classes are not representative of the test classes.

For reporting the recognition performance, 1000 random test episodes were selected and accuracy was obtained by averaging over all the test episodes. Each episode contained the corresponding -way -shot support samples and 5 query samples per way for testing.

Method 5-way 1-shot 5-way 5-shot 20-way 1-shot 20-way 5-shot
SIAMESE [26] 97.3 98.4 88.1 97.0
MANN [51] 82.8 94.9 - -
MATCHING NET [59] 98.1 98.9 93.8 98.5
SIAMESE MEMORY [24] 98.4 99.6 95.0 98.6
NEURAL STATISTICIAN [12] 98.1 99.5 93.2 98.1
MAML [14] 98.70.4 99.90.1 95.80.3 98.90.2
META NET [39] 99.0 - 97.0 -
PROTO NET [55] 98.8 99.7 96.0 98.9
RELATION NET [57] 99.60.2 99.80.1 97.60.2 99.10.1
OUR PROPOSED METHOD 99.20.3 99.50.2 97.20.3 98.90.3
TABLE I:

Results of few-shot classification on the Omniglot dataset. Accuracies in % are reported as averaged over 1000 test episodes. Some of the studies report 95% confidence interval while some do not report results as shown by ’-’

For the miniImagenet dataset, we used the same feature extraction network architecture as the Omniglot dataset. However, since the miniImagenet dataset has images of size , the convolution module produces a 1600-dimensional feature vector. The variance estimator is also the same as that of the Omniglot dataset except that this estimator contains a max-pooling stage before the non-linearity. This is required to reduce the (1600-dimensional) feature map to a scalar variance value. The transformation layer consists of three fully connected layers of 3200, 2400 and 1600 dimensions. The overall architecture of all the modules used for the miniImagenet dataset is shown in Fig. 7(b).

The hyper-parameters , and were set to , and , respectively. For testing on the 5-way 1-shot and 5-way 5-shot episodic strategy, we used a 20-way 1-shot and 20-way 5-shot sampling strategy, respectively, in the first-stage training. Each episode contained the corresponding -way -shot support samples and 15 query samples per way for testing. Results were reported by computing the average accuracy over 600 such randomly sampled episodes with 95% confidence interval.

For the CUB-200 and CIFAR-100 datasets, we used the same four-convolutional-blocks-based architecture as the feature extractor that has been previously used on the miniImagenet and Omniglot datasets. This embedding results in 1600 and 256 dimensional feature spaces for the CUB-200 and the CIFAR-100 datasets, respectively. The transformation layer for the CUB-200 dataset consists of three fully connected layers of 3200, 2400 and 1600 dimensions. The architecture of for the CUB-200 dataset is the same as that of the miniImagenet dataset. The transformation layer for the CIFAR-100 dataset consists of three fully connected layers of 512, 384 and 256 dimensions. The architecture of for the CIFAR-100 dataset is similar to that of the miniImagenet dataset except that the max-pooling step is applied only on the second convolutional block. The hyper-parameters , and were set to , and on both the CUB-200 and the CIFAR-100 datasets. It is important to note that for a fair comparison, we only report previous work that used the simple four-convolutional-block-based embedding instead of the more sophisticated ResNet [20] architecture.

Iv-C Comparison against Related Approaches

Since our proposed few-shot learning method has both meta-learning and metric-learning components, we compared our proposed method against recent meta-learning [14, 39, 48] and metric-learning [26, 59, 55, 57] methods. We also compared against recent memory-based models [51, 24] and the Neural Statistician method [12] that learns how to represent statistics of the data. The results of the comparisons on the Omniglot dataset are shown in Table I.

As seen from Table I, most of the recent methods achieved almost perfect recognition performance on the Omniglot dataset (8 out of 10 methods obtained an average accuracy of more than 98% for the 5-way 1-shot task). Our proposed method obtained an average accuracy of 99.2% and 97.2% for the 5-way 1-shot and 20-way 1-shot tasks, respectively, which are better than most of the previous approaches. However, Relational Network [57] produced the best result; that is, 99.6% and 97.6% for the 5-way 1-shot and 20-way 1-shot tasks, respectively, because it learned a distance metric while our proposed method used a predefined Mahalanobis distance metric. The confidence interval of our proposed method (98.9%-99.5%) also overlapped with that of the Relational Network approach (99.4%-99.8%) for the 5-way 1-shot task. The confidence interval overlapped for the 20-way 1-shot task as well. As expected, higher shots during the testing produced better results (98.9%97.2% for the 20-way task) for our proposed method because they represented the class statistics better than by just using one shot. Also, higher ways produced worse result (97.2%99.2% for the 1-shot task) because there were more potential classes to choose from and the chances of misclassification were higher.

For the miniImagenet dataset, the comparison is more challenging and there is more room for improvement towards perfect performance. The results of the comparison are shown in Table II. From Table II, we can see that our proposed method produced an average accuracy of 52.68% and 70.91% on the 5-way 1-shot and 5-way 5-shot tasks, respectively, which are better than most of the previous methods. This can be mainly attributed to our two-stage training procedure, where the model learns to both represent and classify in a low-shot regime. However, the methods – Predicting Parameters from Activation [46] (PPA) and Transductive Propagation Networks [33] (TPN) produced better results than our proposed method in the 1-shot setting. Upon inspection, we realized that the PPA method used pre-trained embedding while most other few-shot learning methods and our training method of the embedding/feature extractor were done from scratch. Using a pre-trained embedding implies that datasets beyond the base and novel categories have been used in training the model and therefore the model would not be suitable for comparison. However, we still included the results for PPA in Table II for the sake of completeness. Also, the TPN method uses a transductive approach which assumes all the test/query data are available as a batch. The improvement in performance of this method is mainly due to the fact that the authors used the manifold of the unlabeled test data as well as support data to do inference. However, the method might not work if the number of query points is less or the query points arrive in a streaming fashion as in a real-world situation.

The results of our proposed method in comparison with previous work for the CUB-200 and CIFAR-100 datasets are shown in Table III and Table IV, respectively. In Table III, on the CUB-200 dataset, our proposed method produced about 6 points improvement over the second best method. Similarly, in Table IV, on the CIFAR-100 dataset, our proposed method produced around 2 points improvement over the second best method. This suggests that our proposed method can provide competitive performance on fine-grained and low-resolution datasets as well. Also, the average performance on the CUB-200 dataset is less than that on the CIFAR-100 dataset. This is because the CUB-200 dataset contains more fine-grained categories compared to the CIFAR-100 dataset and therefore classes overlap more in the CUB-200 dataset.

From these comparative studies, it is not clear how all the modules in our trainable neural-network architecture contributed to the performance. Therefore, we resort to further analyzing each component of our proposed method in the following sub-sections.

Method 5-way 1-shot 5-way 5-shot
META-LSTM [48] 43.440.77 60.600.71
MAML [14] 48.701.84 63.110.92
MATCHING NET [59] 43.560.84 55.310.73
META NET [39] 49.210.96
PROTO NET [55] 49.420.78 68.200.66
RELATION NET [57] 51.380.82 67.070.69
GNN [16] 50.330.36 66.410.63
REPTILE [41] 49.97 65.99
TPN [33] 53.75 69.43
PPA [46] 54.530.40 67.070.20
R2D2 [2] 51.80.2 68.40.2
OUR PROPOSED METHOD 52.680.51 70.910.85
TABLE II: Results of few-shot classification on the miniImagenet dataset. Accuracies are reported as averaged over 600 test episodes. Most of these studies report 95% confidence interval while unreported results are shown as ’–’
Method 5-way 1-shot 5-way 5-shot
META-LSTM [48] 40.43 49.65
MAML [14] 38.43 59.15
MATCHING NET [59] 49.34 59.31
PROTO NET [55] 45.27 56.35
OUR PROPOSED METHOD 55.85 66.73
TABLE III: Results of few-shot classification on the CUB-200 dataset where our accuracy is reported as averaged over 600 test episodes
Method 5-way 1-shot 5-way 5-shot
MAML [14] 58.91.9 71.51.0
PROTO NET [55] 55.50.7 72.00.6
RELATION NET [57] 55.01.0 69.30.8
GNN [16] 61.9 75.3
R2D2 [2] 65.40.2 79.40.2
OUR PROPOSED METHOD 67.150.3 81.650.3
TABLE IV: Results of few-shot classification on the CIFAR-100 dataset where the accuracy is reported as averaged over 10000 test episodes. Most of these studies report 95% confidence interval
5-way 1-shot Testing 5-way 5-shot Testing
Training way 5 10 15 20 25 30 5 10 15 20 25 30
PN 43.987 46.956 46.589 46.122 47.253 47.3 62.693 64.742 64.524 63.578 62.416 61.9
PN+V 44.411 47.067 47.936 48.304 47.778 48.067 64.813 65.033 66.158 65.37 64.318 64.82
PN+R 47.849 50.309 52.631 52.607 52.14 51.996 66.758 70.831 70.771 70.447 71.147 62.733
PN+T 43.942 45.944 47.263 48.022 48.122 48.011 62.396 63.316 64.342 63.024 63.531 64.86
PN+V+R 49.322 51.057 51.031 52.782 52.716 51.773 69.1 70.936 71.496 71.36 70.36 68.23
PN+V+T 45.689 47.927 48.422 48.002 47.693 47.947 61.667 63.484 63.736 62.431 61.978 63.48
PN+R+T 46.913 51.224 52.338 53.036 53.789 53.66 68.76 71.38 72.34 72.151 72.584 67.34
TABLE V: Ablative study of our approach on the miniImagenet dataset. Averaged accuracy is reported as the training way is varied. Ablations include the Variance estimator (V), Relative features (R), and Category-agnostic Transformer (T). The baseline is the Prototypical Network (PN)

Iv-D Ablation Study with Varying Training and Testing Conditions

The contribution of this paper consists of the following modules on top of the Prototypical Network (PN) – a variance estimator (V), the relative features (R), and the category-agnostic transformer (T). We thus performed an ablative study, where we added all combinations of the modules on the PN and observed the change in performance. Results of this experiment are reported in Table V as the training way is varied for the 5-way 1-shot and 5-way 5-shot testing conditions.

We provided our own implementation of PN in this experiment and future experiments. From Table V

, it reveals that the addition of the relative features (R) has the most significant effect on the performance followed by the variance estimator (V) and the category-agnostic transformer (T). This is because relative features try to diminish the difference between feature dimensionality and the number of samples, and thus try to alleviate overfitting. On the other hand, PN+T has negligible improvement or slightly worse performance compared to the PN baseline. This is because prototypical networks tend to cluster same-class samples very close to one another and therefore additional transformation stage (T) to map samples to prototype might be redundant. In certain cases, the complex non-linear transformation might over-fit to produce worse performance. It should be noted that higher ways in training do not always produce better performance. For example, in a 5-way 1-shot testing, PN+R produced a peak in performance for the 15-way training strategy with a dip in performance on either side. Similar pattern can be observed for the 5-way 5-shot testing results. The effect of relative features is also significant in case pairs of modules are added to the PN baseline. In Table

V, we can see that PN+V+R and PN+R+T reached accuracy levels over 50% and 70% for the 5-way 1-shot and 5-way 5-shot testing cases, respectively, but PN+V+T failed to do so. An interesting observation is that the combined effects of R+T mostly provided better performance than V+R even though V provided better performance than T. This suggests that adding modules upon the PN baseline did not always produce additive effects but they also produced interactive effects between the two modules.

Iv-E Parameter Sensitivity Studies

We also performed experiments to find how the performance of PN+R varied with changing . The results are shown in Fig. 8 for both 5-way 1-shot and 5-way 5-shot testing conditions. The training condition for 5-way 1-shot testing is 20-way 1-shot and that for the 5-way 5-shot testing is 20-way 5-shot. The PN baseline is shown using the dotted line. From the plot, it is shown that the accuracy followed a bell-curve with the maximum accuracy observed at . It is recommended not to use as it caused degradation in performance, which was sometimes worse than the PN baseline. This is because putting excess weight on relative features diminishes the effect of absolute features that are crucial for recognition.

Fig. 8: Plot of accuracy with respect to for 5-way 1-shot (5w1s) and 5-way 5-shot (5w5s) testing conditions with the prototypical network baseline. The dataset used is miniImagenet.

We also studied the effect of changing and on the recognition performance for different testing shots. In Fig. 9, we see that the performance varied for different thresholds with a peak performance obtained for a value of between 0 and 1. In fact, for the higher shot configuration, the peak performance was obtained at a higher threshold. This is because for higher shots, the contribution of the few-shot sample mean was much more compared to the contribution of the base categories. As a result, a higher threshold was required to reduce the contribution of the base classes.

Fig. 9: Plot of accuracy with respect to for 5-way 1-shot (5w1s) and 5-way 5-shot (5w5s) testing conditions with the prototypical network baseline. The dataset used is miniImagenet.

In Fig. 10, we observed how the recognition performance changed as was varied for different shots. As expected, the peak performance was better than the baseline shown in dashed lines. However, the sensitivity at the 5-shot configuration was less compared to that in the 1-shot configuration. This is because, for higher shots, the constraint corresponding to - that the sample mean should be close to the prototype is automatically satisfied and therefore changing the value of did not change the performance much.

We did additional sensitivity studies of and over a smaller range of values. The results are reported in Tables VI and VII for and , respectively. From the results, it showed that there was very little change when the parameters were varied over such a small range. However, the response was oscillatory probably because of the non-convexity of the loss functions used in our framework.

Fig. 10: Plot of accuracy with respect to for 5-way 1-shot (5w1s) and 5-way 5-shot (5w5s) testing conditions. The dataset used was miniImagenet.
0.02 0.04 0.06 0.08 0.1
5-way 1-shot 48.01 48.23 48.11 48.34 48.66
5-way 5-shot 62.39 62.31 62.54 62.51 62.73
TABLE VI: Performance sensitivity with respect to threshold over a small range. The dataset used is miniImagenet.
1e-4 2e-4 4e-4 8e-4
5-way 1-shot 49.51 50.64 50.50 50.78
5-way 5-shot 63.11 63.26 63.36 63.34
TABLE VII: Performance sensitivity with respect to over a small range. The dataset used is miniImagenet.

Iv-F Feature Visualization

We also visualized the features in two dimensions using t-SNE [34] as shown in Fig. 11. From Fig. 11(a), it is clear that PN produced a very compact feature space, where the classes were very difficult to distinguish. On the other hand, the features obtained using PN+R+V as shown in Fig. 11(b) were more distinguishable class-wise. This resulted in better recognition performance.

It is important to note that removing the outlier from Fig. 11(a) and rescaling the figure would make the image similar to Fig. 11(b). This is the point of difference between using Prototypical Network (PN) and our (PN+R+V) method. Using PN, we obtained more scaled-down features. Thus, these features were closer to one another, resulting in more difficult classification compared to our (PN+R+V) method. However, distinguishing classes in both cases was complicated and that is why we used the Euclidean-distance-based differential nearest-neighbor classifier.

Fig. 11: t-SNE plot for (a) PN and (b) PN+R+V (). The dataset used was miniImagenet. Same color corresponds to different samples of the same category.

Iv-G Convergence Results

We also reported the training and testing performance with increasing training episodes in Fig. 12. We used the 5-way 5-shot and 20-way 5-shot settings for testing and training, respectively. As shown in Fig. 12, the test accuracy for PN+V+R rose fast compared to that of PN. Also, the training accuracy was quite noisy. This is because each training episode produced a newer set of categories and therefore there was a high variance in the training accuracy.

Fig. 12: Training and test accuracy with increasing number of episodes for the prototypical network (PN) baseline and our proposed approach using relative features and variance estimator (PN+V+R).

Iv-H Effect of Number of Samples

Since the relative features are constructed using both the support and query points, it is worthwhile to note the effect on recognition performance by changing the number of query points per class in the training and testing stages. We performed two experiments for the PN+R case. The first experiment considered the situation when the number of training query points per class was fixed at 15 and the number of test query points was varied. The second experiment considered the situation when the number of test query points per class was fixed at 15 and the number of training points was varied. In Fig. 13, it is shown that as the number of query points increased, the recognition performance increased and it became saturated after a while. This is because query points beyond a certain quantity did not provide additional second-order structural information. Also, from the poor performance in the case of one test query sample, it is evident that having sufficient query samples in the testing stage was more important than having sufficient quantity of query samples in the training stage.

Fig. 13: Plot of accuracy when the number of training query points is fixed and the number of test query points is varied and vice-versa. The dataset used was miniImagenet.

Iv-I Effect of Base Categories

We also evaluated how the performance of PN+T varied as the number of source base categories changed. Results are shown in Table VIII. The recognition performance increased with the increasing number of source categories. This is because the increasing number of source categories trained a robust feature space. Also, the probability of finding relevant categories became more for the category-agnostic transformation stage. The performance of the category-agnostic transformation became poorer at higher shots compared to PN. This is because the transformation became closer to identity and its significance became less.

Till now, we have tested our proposed approach on the novel categories. It is also important to test our proposed approach on base categories since they are more common and are likely to be observed more frequently compared to novel categories. The results of applying our proposed approach to the base categories are shown in Table IX for different testing settings. As expected, the performance on base categories was better compared to that of novel categories. Furthermore, our proposed approach (PN+V+R) produced better results as compared to PN.

Source No. 20 30 40 50 60
5-way 1-shot (PN) 40.10 42.196 44.14 45.66 45.74
5-way 1-shot (PN+T) 41.61 43.74 45.48 46.82 47.20
5-way 5-shot (PN) 45.89 51.93 55.95 59.796 60.96
5-way 5-shot (PN+T) 43.89 50.93 55.85 59.70 61.14
TABLE VIII: Performance analysis as the number of base categories is varied for the PN+T case. The dataset used is miniImagenet.
5-way 5-shot
(PN)
5-way 1-shot
(PN)
5-way 5-shot
(PN+V+R)
5-way 1-shot
(PN+V+R)
82.236 58.409 85.293 64.111
TABLE IX: Performance comparison of testing on the base training classes.

Iv-J Analysis of Category-agnostic Transformation

We also carried out the ablation analysis of PN+T; that is, the addition of the category-agnostic transformer (T) on top of the prototypical network baseline (PN). As described previously, the category-agnostic transformer (T) consists of three modules - the neural-network-based transformer (), the residual connection (), and the contribution of the base prototypes (). From Table X, we can see that the addition of these modules gradually improved the recognition performance, suggesting that the addition of all these modules was important. The method PN+++ used a threshold . It is important to note that using PN++ was equivalent to PN+++ with threshold . We also performed an additional experiment using the method PN+++ with threshold . Using , we obtained an accuracy of and on the 5-way 1-shot and 5-way 5-shot classification tasks, respectively. The recognition performance was worse compared to using because caused all the base classes and therefore irrelevant classes to contribute to the category-agnostic transformation thus causing a negative transfer.

Method 5-way 1-shot 5-way 5-shot
PN+ 47.393 62.411
PN++ 48.604 62.683
PN+++ 49.002 63.024
TABLE X: Ablation analysis of each component of the category-agnostic transformer. The dataset used was miniImagenet.

The category agnostic transformer consisted of contribution of the base categories as described mathematically through . Using the threshold mechanism, only relevant base categories were selected for contribution because these categories were closer to the novel category in the feature space compared to the irrelevant base categories. Using the thresholded probability vector , we selected the top three relevant base categories for a few novel categories. The results are shown in Table XI. As an example, all the top relevant categories for the African hunting dog have canine features. The relevant categories for the mixing bowl seem to fit in context. Pictures of Consomme and Hotdog are generally shown in plates or bowls. Also, the relevant categories of nematode, a worm-like organism involved insects and snakes. There could be erroneous selections like harvestman spider being the most relevant category for the Golden-retriever dog. This suggested that an additional class relevance criterion based on WordNet [37] might be more appropriate.

Novel /Relevant Class Rank 1 Rank 2 Rank 3
African hunting dog Saluki Arctic Fox Komondor
Mixing bowl Consomme Hotdog Ear
Golden-retriever Harvestman Miniature poodle Bolete
Nematode Green-mamba Lady-bug Spider-web
TABLE XI: Novel categories and top three relevant base categories.

V Conclusions

We have proposed a two-stage framework for few-shot learning of image recognition. The framework has contributions at both the feature extraction stage and the classification stage of image recognition. At the feature extraction stage, we proposed the use of relative-feature representation as well as the Mahalanobis distance metric with predictable variance. For the classification stage, we proposed a category-agnostic transformation that produces class prototypes from class samples. Results on standard few-shot learning datasets showed our approach to be comparable or even better than previous approaches. We also provided further analysis on our model and concluded that the relative-feature component was mostly responsible for the improvement of the performance of our proposed approach. In the future, we would like to extend our work to zero-shot classification, where we do not have any support samples from the novel class but only high-level semantic information for each of these classes.

References

  • [1] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas (2016) Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989. Cited by: §II.
  • [2] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi (2019) Meta-learning with differentiable closed-form solvers. In Intern. Conf. Learn. Repr., Cited by: §II, TABLE II, TABLE IV.
  • [3] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi (2016) Learning feed-forward one-shot learners. In Advan. Neu. Inf. Proc. Syst., pp. 523–531. Cited by: §IV-A.
  • [4] O. Chapelle, B. Scholkopf, and A. Zien (2009) Semi-supervised learning (O. Chapelle, B. Scholkopf, and A. Zien, eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20 (3), pp. 542–542. Cited by: §II.
  • [5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. Cited by: §II.
  • [6] D. Das and C. S. G. Lee (2018) Graph matching and pseudo-label guided deep unsupervised domain adaptation. In Proceedings of the International Conference on Artificial Neural Networks, Cited by: §III-A.
  • [7] D. Das and C. S. G. Lee (2018) Sample-to-sample correspondence for unsupervised domain adaptation.

    Engineering Applications of Artificial Intelligence

    73, pp. 80–91.
    Cited by: §III-A.
  • [8] D. Das and C. S. G. Lee (2018) Unsupervised domain adaptation using regularized hyper-graph matching. In IEEE Intern. Conf. Image Processing, pp. 3758–3762. Cited by: §III-A.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 248–255. Cited by: §I.
  • [10] E. L. Denton, S. Chintala, and R. Fergus (2015) Deep generative image models using a Laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems, pp. 1486–1494. Cited by: §II.
  • [11] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014) Discriminative unsupervised feature learning with convolutional neural networks. In Advan. Neu. Inf. Proc. Syst., pp. 766–774. Cited by: §II.
  • [12] H. Edwards and A. Storkey (2017) Towards a neural statistician. In Intern. Conf. Learning Representations, Cited by: §IV-C, TABLE I.
  • [13] L. Fei-Fei, R. Fergus, and P. Perona (2006) One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28 (4), pp. 594–611. Cited by: §II.
  • [14] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400. Cited by: §II, §IV-C, TABLE I, TABLE II, TABLE III, TABLE IV.
  • [15] H. Gao, Z. Shou, A. Zareian, H. Zhang, and S. Chang (2018) Low-shot learning via covariance-preserving adversarial augmentation networks. In Advan. Neu. Inf. Proc. Syst., pp. 975–985. Cited by: §II.
  • [16] V. Garcia and J. Bruna (2018) Few-shot learning with graph neural networks. In Intern. Conf. Learn. Repr., Cited by: §II, TABLE II, TABLE IV.
  • [17] S. Gidaris and N. Komodakis (2018) Dynamic few-shot visual learning without forgetting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §III-A.
  • [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §II.
  • [19] B. Hariharan and R. Girshick (2017) Low-shot visual recognition by shrinking and hallucinating features. In Proc. of IEEE Int. Conf. on Computer Vision (ICCV), Venice, Italy, Cited by: §II.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 770–778. Cited by: §III-D, §IV-B.
  • [21] N. Hilliard, L. Phillips, S. Howland, A. Yankov, C. D. Corley, and N. O. Hodas (2018) Few-shot learning with metric-agnostic conditional embeddings. arXiv preprint arXiv:1802.04376. Cited by: §IV-A.
  • [22] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II.
  • [23] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Intern. Conf. Mach. Learn., pp. 448–456. Cited by: §IV-B.
  • [24] L. Kaiser, O. Nachum, A. Roy, and S. Bengio (2017) Learning to remember rare events. In Intern. Conf. Learning Representations, Cited by: §IV-C, TABLE I.
  • [25] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Intern. Conf. Learning Representations, Cited by: §IV-B.
  • [26] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In

    ICML Deep Learning Workshop

    ,
    Vol. 2. Cited by: §II, §IV-C, TABLE I.
  • [27] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical report Univ. of Torornto. Cited by: §IV-A.
  • [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advan. Neu. Inf. Proc. Syst., pp. 1097–1105. Cited by: §I.
  • [29] B. M. Lake, R. Salakhutdinov, J. Gross, and J. B. Tenenbaum (2011) One shot learning of simple visual concepts. In Proc. Annual Conf. of the Cognitive Science Society, pp. 2. Cited by: §II, §IV-A.
  • [30] B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum (2013) One-shot learning by inverting a compositional causal process. In Advan. Neu. Inf. Proc. Syst., pp. 2526–2534. Cited by: §II.
  • [31] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum (2015) Human-level concept learning through probabilistic program induction. Science 350 (6266), pp. 1332–1338. Cited by: §II.
  • [32] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §I.
  • [33] Y. Liu, J. Lee, M. Park, S. Kim, E. Yang, S. J. Hwang, and Y. Yang (2019) Learning to propagate labels: transductive propagation network for few-shot learning. In Intern. Conf. Learn. Repr., Cited by: §II, §IV-C, TABLE II.
  • [34] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9 (Nov), pp. 2579–2605. Cited by: §IV-F.
  • [35] A. Mehrotra and A. Dukkipati (2017) Generative adversarial residual pairwise networks for one shot learning. arXiv preprint arXiv:1703.08033. Cited by: §II.
  • [36] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka (2013) Distance-based image classification: generalizing to new classes at near-zero cost. IEEE transactions on pattern analysis and machine intelligence 35 (11), pp. 2624–2637. Cited by: §II, §III-C.
  • [37] G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §IV-J.
  • [38] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel (2017) A simple neural attentive meta-learner. In NIPS 2017 Workshop on Meta-Learning, Cited by: §II.
  • [39] T. Munkhdalai and H. Yu (2017) Meta networks. In International Conference on Machine Learning, pp. 2554–2563. Cited by: §II, §IV-C, TABLE I, TABLE II.
  • [40] T. Munkhdalai, X. Yuan, S. Mehri, and A. Trischler (2018) Rapid adaptation with conditionally shifted neurons. In Intern. Conf. Mach. Learn., Cited by: §II.
  • [41] A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: TABLE II.
  • [42] M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2014) Learning and transferring mid-level image representations using convolutional neural networks. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 1717–1724. Cited by: §III-A.
  • [43] B. Oreshkin, P. R. López, and A. Lacoste (2018) Tadam: task dependent adaptive metric for improved few-shot learning. In Advan. Neu. Inf. Proc. Syst., pp. 721–731. Cited by: §II.
  • [44] S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Trans. Knowledge Data Engg. 22 (10), pp. 1345–1359. Cited by: §III-D.
  • [45] H. Qi, M. Brown, and D. G. Lowe (2018) Low-shot learning with imprinted weights. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 5822–5830. Cited by: §II.
  • [46] S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018) Few-shot image recognition by predicting parameters from activations. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 7229–7238. Cited by: §II, §IV-C, TABLE II.
  • [47] T. Ramalho and M. Garnelo (2019) Adaptive posterior learning: few-shot learning with a surprise-based memory module. In Intern. Conf. Learn. Repr., Cited by: §II.
  • [48] S. Ravi and H. Larochelle (2017) Optimization as a model for few-shot learning. In International Conference on Learning Representations, Cited by: §II, §IV-C, TABLE II, TABLE III.
  • [49] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. Intern. J. of Computer Vision 115 (3), pp. 211–252. Cited by: §IV-A.
  • [50] R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba (2012) One-shot learning with a hierarchical nonparametric bayesian model. In Proc. Intern. Conf. Machine Learning, pp. 195–206. Cited by: §II.
  • [51] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap (2016) Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning, pp. 1842–1850. Cited by: §II, §IV-C, TABLE I.
  • [52] E. Schwartz, L. Karlinsky, J. Shtok, S. Harary, M. Marder, A. Kumar, R. Feris, R. Giryes, and A. Bronstein (2018) Delta-encoder: an effective sample synthesis method for few-shot object recognition. In Advan. Neu. Inf. Proc. Syst., pp. 2845–2855. Cited by: §II.
  • [53] J. Shu, Z. Xu, and D. Meng (2018) Small sample learning in big data era. arXiv preprint arXiv:1808.04572. Cited by: §II.
  • [54] P. Shyam, S. Gupta, and A. Dukkipati (2017) Attentive recurrent comparators. In International Conference on Machine Learning, pp. 3173–3181. Cited by: §II.
  • [55] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4080–4090. Cited by: §II, §III-C, §IV-B, §IV-C, TABLE I, TABLE II, TABLE III, TABLE IV.
  • [56] Q. Sun, Y. Liu, T. Chua, and B. Schiele (2019-06) Meta-transfer learning for few-shot learning. In Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Cited by: §II.
  • [57] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II, §IV-C, §IV-C, TABLE I, TABLE II, TABLE IV.
  • [58] E. Triantafillou, R. Zemel, and R. Urtasun (2017) Few-shot learning through an information retrieval lens. In Advan. Neu. Inf. Proc. Syst., pp. 2255–2265. Cited by: §II.
  • [59] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra (2016) Matching networks for one shot learning. In Advan. Neu. Inf. Proc. Syst., pp. 3630–3638. Cited by: §II, §III-A, §IV-A, §IV-B, §IV-C, TABLE I, TABLE II, TABLE III.
  • [60] Y. Wang and Q. Yao (2019) Few-shot learning: a survey. arXiv preprint arXiv:1904.05046. Cited by: §II.
  • [61] Y. Wang, R. Girshick, M. Herbert, and B. Hariharan (2018) Low-shot learning from imaginary data. In Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
  • [62] Y. Wang and M. Hebert (2016) Learning to learn: model regression networks for easy small sample learning. In European Conference on Computer Vision, pp. 616–634. Cited by: §III-A, §III-D.
  • [63] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §IV-A.
  • [64] A. Wong and A. L. Yuille (2015) One shot learning via compositions of meaningful patches. In Proc. IEEE Int. Conf. Comput. Vis., pp. 1197–1205. Cited by: §II.
  • [65] R. Zhang, T. Che, Z. Ghahramani, Y. Bengio, and Y. Song (2018) Metagan: an adversarial approach to few-shot learning. In Advan. Neu. Inf. Proc. Syst., pp. 2365–2374. Cited by: §II.
  • [66] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva (2014) Learning deep features for scene recognition using places database. In Advances in neural information processing systems, pp. 487–495. Cited by: §I.
  • [67] X. Zhu, D. Anguelov, and D. Ramanan (2014) Capturing long-tail distributions of object subcategories. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 915–922. Cited by: §I.
  • [68] X. Zhu, C. Vondrick, C. C. Fowlkes, and D. Ramanan (2015) Do we need more training data?. Intern. J. of Computer Vision, pp. 1–17. Cited by: §II.