Variational Transfer Learning for Fine-grained Few-shot Visual Recognition

10/07/2020 ∙ by Jingyi Xu, et al. ∙ Stony Brook University 0

Fine-grained few-shot recognition often suffers from the problem of training data scarcity for novel categories.The network tends to overfit and does not generalize well to unseen classes due to insufficient training data. Many methods have been proposed to synthesize additional data to support the training. In this paper, we focus one enlarging the intra-class variance of the unseen class to improve few-shot classification performance. We assume that the distribution of intra-class variance generalizes across the base class and the novel class. Thus, the intra-class variance of the base set can be transferred to the novel set for feature augmentation. Specifically, we first model the distribution of intra-class variance on the base set via variational inference. Then the learned distribution is transferred to the novel set to generate additional features, which are used together with the original ones to train a classifier. Experimental results show a significant boost over the state-of-the-art methods on the challenging fine-grained few-shot image classification benchmarks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Fine-grained visual recognition aims to classify images belonging to the same or closely-related categories, where the features to discriminate one class from the others are often subtle and in fine detail. Thus, training a fine-grained classification model often requires a large amount of training data. However, fine-grained labeled data is often scarce in real-world scenarios due to the high annotation cost. A solution to alleviate this data dependency issue is few-shot learning, which aims to recognize a set of classes given access to only a few training examples.

Since only a few training examples are available for novel classes, the learned classifier tends to overfit the few given samples while can not generalize well to unseen data, see Fig. 1a and 1b. To address this issue, numerous data augmentation and synthesis methods [Schwartz et al.2018, Tsutsui, Fu, and Crandall2019] have been proposed to increase the diversity of the available training data from the novel classes and better approximate the true data distribution (Fig.1c). For example, Satoshi et al. [Tsutsui, Fu, and Crandall2019]

use generative adversarial networks (GAN) to generate synthetic images and combine them with the original images to train one-shot classifiers. However, these generated images are constrained to be similar to the real ones, thus can not introduce diverse examples beyond the original images. Instead of generating images directly, Schwartz

et al. [Schwartz et al.2018] propose to model the transformations between pairs of examples from the same class with an auto-encoder, which can then be applied to the few novel class examples to synthesize samples. However, they do not take into account the overall distribution of the generated data, as they mainly focus on modeling the transferable intra-class deformations.

One of the main challenges of few-shot learning is the low intra-class variance for the unseen classes due to the lack of training instances. Low intra-class variance makes it harder to identify the discriminative features for each class. For any given feature, it is challenging to determine if it is common to all classes or is representative for the particular class. For example, an unseen class of a bird species might only contain examples of the bird flying in the blue sky background. With only those training examples, the model can simply learn to detect the blue sky background and the flying action as the representative features of the class.

Figure 1: Decision boundaries learned from different set of training samples. (a). Full training samples. (b). Insufficient training samples. (c). Insufficient training samples and additional augmented feature representations. The lack of training data leads to a biased decision boundary. Additional training features aim to correct this bias. (Best viewed in color.)

In this paper, we propose a method to increase the intra-class variance of the novel set. We observe that certain modes of intra-class variation generalize across categories, especially for fine-grained visual classification where the inter-class distance is relatively small. For example, a base class in the CUB dataset [Welinder et al.2010] includes images of a bird species with multiple viewpoints, background types, or actions. These intra-class variations are relatively common across all classes of the CUB dataset, including the “unseen” ones. Hence, we propose to learn a distribution that models the intra-class variance from the base class and use it to diversify the unseen class.

In particular, we first train a variational inference model in which each data point in the embedding space can be decomposed into an intra-class invariance and an intra-class variance component. Such a model can be trained using base class data points where the intra-class variance part is forced to follow isotropic multivariate Gaussian distribution. Once this intra-class variance distribution is learned, we use this to generate intra-class variance vectors to synthesize additional features for the unseen class. Specifically, for images in the novel set, we first extract the class-specific features using the trained variational inference model. Then we sample features repeatedly from the learned intra-class variance distribution and add them to the class-specifc features to get augmented features. The classifier is trained with both the original and augmented features.

To summarize, the contributions of the work are:

(1) We learn the distribution of intra-class variance by variational inference on the base set in an end-to-end training manner.

(2) We demonstrate that the learned distribution can be transferred to novel set images effectively to generate more diverse features which lead to a more robust classifier.

(3) Experimental results on fine-grained visual recognition benchmarks show that our method boosts the performance of existing state-of-the-art few-shot learning methods.

Related Work

Few-shot Learning

Few-shot learning aims to recognize an image of a novel class with very few labeled examples available. The main algorithms of few-shot classification can be broadly organized into three categories: metric learning based, meta-learning based and data augmentation based.

Metric learning based methods attempt to measure the similarities between images via mapping them into a common embedding space. The derived embedding space is expected to be discriminative, i.e, feature representations from the same object class are closer together. Vinyals et al. [Vinyals et al.2016] proposed Matching Networks, which uses an attention mechanism over a learned embedding of the labeled set of examples to predict classes for the unlabeled points. Prototypical Network [Snell, Swersky, and Zemel2017] learned to classify query samples based on their Euclidean distance to prototype representations of each class. Sung et al. [Sung et al.2018] propose to measure the distance metric with a CNN-based relation module.

Meta learning based methods intend to design models which generalize to new tasks rapidly and efficiently. MAML [Finn, Abbeel, and Levine2017] uses a meta-learner to find an initialization which can be adapted to new categories within few gradient updates using small training data. Meta-SGD [Li et al.2017] learns to learn not only the learner initialization but also the learner update direction and learning rate. Lee et al. proposed MetaOptNet [Lee et al.2019], which uses discriminatively trained linear predictor as base learners to learn feature representation for few-shot learning.

Data augmentation based methods aim to generate additional training examples to alleviate the problem of data insufficiency. DAGAN [Antreas Antoniou2018] uses conditional generative adversarial network (GAN) to transfer the style, which can enhance standard vanilla classifiers as well as few-shot learning systems. Wang et al. [Wang et al.2018] propose to combine a meta-learner with a hallucinator, which can hallucinate novel instances of new classes, and optimize both models jointly. The Delta-encoder [Schwartz et al.2018] trains an auto-encoder to model the transferable intra-class deformations from image pairs of the same class, which can then be applied to the few novel class examples to synthesize samples. However, the performance is heavily determined by the image pairs selected to learn such transformations. Our method follows this line of work, but synthesizes samples via sampling from a learned posterior distribution, avoiding the carefully designed data selection and model training process.

Transfer Learning

Transfer learning techniques try to transfer the knowledge from some previous tasks to a new task [Chen and Liu2013, Pan and Yang2009]. A key issue for transfer learning is “what to transfer”, e.g., which learnt components can be transferred across tasks. In [Yin et al.2019], the distribution of feature variance is transferred from regular classes to under-presented classes to address the issue of imbalanced training data. Sun et al. propose MTL [Sun et al.2018], which transfers pre-trained weights learned on large-scale datasets and aims to meta-learn how to transfer effectively. Hariharan et al. [Hariharan and Girshick2017] present a way of “hallucinating” additional examples for novel classes by transferring modes of variation from the base classes. They search for the extracted feature vectors of base categories to collect all quadruplets of transformation “analogies” to train a feature generator. Our variational transfer learning method leverages the idea of transferring the intra-class variance, which intuitively generalizes across classes. However, unlike [Hariharan and Girshick2017] which trains a feature generator separately, we model the intra-class variance distribution of base categories in an end-to-end training manner. By sampling from the distribution repeatedly, we get synthesized features for novel set images.

Variational Inference

Variational inference aims to estimate the true distribution of some latent variables in a probabilistic manner. It has been widely used in generative models

[Kingma and Welling2014, Higgins et al.2017]

, i.e, variational autoencoders (VAE) to produce diverse outputs. Recently, it has also been explored in discriminative models

[Zhang et al.2019, Schonfeld et al.2019, Kim et al.2019] on few-shot scenarios as well as metric learning problems[Lin et al.2018]. Zhang et al. [Zhang et al.2019]

use variational inference to estimate a class-specific distribution and straightforwardly compute the probability of novel input to perform classification. Schonfeld

et al. [Schonfeld et al.2019] learn a shared latent space of image features and class embeddings via aligned variational autoencoders. The latent features contain the required discriminative information about the image and classes, which can be then used to train a softmax classifier. Instead of modeling discriminative features directly, we use variational inference to model the distribution of intra-class variance, from which multiple features can be sampled to augment the original embedding space.

Figure 2: The pipeline of our proposed method. In the training stage, given an input image from the base set, the output feature maps of the feature extractor are taken as the encoder’s input to model the intra-class variance . Average pooling results over the feature maps are used to approximate the class-specific features . The decoder takes the sum of and as input and is trained to reconstruct the original feature maps. Then in the fine-tuning stage, and the sum of and are combined together to train a linear classifier. Best viewed in color.

Few-shot Learning Preliminaries

In few-shot learning, abundant labeled images of base classes and a small number of labeled images of novel classes are given. Our goal is to train a classifier that can correctly classify novel class images with the few examples given. The standard procedure of few-shot learning basically includes two stages: the training stage and the fine-tuning stage. During the training stage, we use base class images to train a feature extractor and the classifier using softmax cross-entropy loss. Then in the fine-tuning stage, we freeze the parameters of the pre-trained feature extractor and train a new classifier head using the few labeled of examples in the novel classes . In the testing stage, the learned classifier predicts labels on a set of unseen novel class images.

Above is the simplest and most commonly used baseline approach for few-shot learning problems. However, since the available samples during the fine-tuning stage are scarce and in lack of diversity, the learned classifier tends to overfit to the few samples and thus perform poorly on testing images. In the following sections, we illustrate how we augment the training samples and significantly improve the performance of the baseline method.

Proposed Method

Variational Inference for Intra-class Variance

Our goal is to generate additional features of the few novel class images which contain larger intra-class variance during the fine-tuning stage. To achieve this, during the training stage, the model is not only trained to extract discriminative features for classification but also trained to model the distribution of intra-class variance. The learned distribution can be then transferred to the novel set for feature augmentation.

To supervise the learning of intra-class variance, we decompose the embedding feature of a given sample into two parts:


where represents the intra-class variance generated from a conditional distribution . represents the class-specific feature of sample from class . And the image is generated from some conditional distribution .

The learning of class-specific feature can be achieved by minimizing cross-entropy loss given the class label :


where is the final fully connected layer for classification. is the total number of classes.

The learning of variable is achieved by variational inference, which provides a probabilistic manner for describing a latent representation. By variational inference, we approximate the posterior distribution with some other distribution

and the Kullback-Leibler divergence between the true distribution and the approximation is:


Since the Kullback-Leibler divergence is always greater than or equal to zero, maximizing the marginal likelihood is equivalent to maximizing the evidence lower bound (ELBO) defined as follows:


where the prior distribution of is set to be a centered isotropic multivariate Gaussian, . For the approximate posterior distribution, we set it to be a multivariate Gaussian with diagonal covariance:


where and are implemented via a probablistic encoder. With the reparameterization trick, we have as follows:


To estimate the maximum likelihood , we use a decoder to reconstruct the original samples from and minimize the distance between the original samples and the reconstructed ones.

Now we can rewrite the loss according to 4 for modeling intra-class variance as follows:


where is the reconstructed sample synthesized from the sum of class-specific feature and intra-class variance , which is sampled from the distribution .

The loss includes two terms. The first term, the reconstruction term, ensures that the encoder extracts meaningful information from the inputs. The second term can be regarded as a regularization term, which forces the latent code,

, for all inputs to follow a standard normal distribution. Here instead of minimizing the Kullback-Leibler divergence directly, we decompose it into three terms as in

[Chen et al.2018]:


where denotes the th dimension of the latent variable.

The three terms in 8 are referred to as the index-code mutual information, total correlation and dimension-wise KL respectively. Prior work [Chen et al.2018, Alessandro and Stefano2018, Burgess et al.2018] has shown that penalizing the index-code mutual information and total correlation terms leads to a more disentangled representation while the dimension-wise KL term ensures the latent variables do not deviate too far form the prior. Similar to [Chen et al.2018], we penalize the total correlation with a weight and can be rewritten as follows:


With joint supervision of and , the model is not only able to extract discriminative class-specific features , but can also model the distribution of intra-class variance and .

To further increase the model’s robustness, we generate hard samples from existing easy ones to train the classifier. Specifically, we draw samples from the distribution of intra-class variance and add them to to construct synthesized embedding features. The synthesized features are then taken as the inputs of the final classification layer and will produce the following cross-entropy loss:


where , and is sampled intra-class variance features. Compared with , contains a larger range of intra-class variance and thus is harder for to classify correctly. Such exposure to hard samples will force the feature extractor to output discriminative features.

The overall loss function in the training stage is a weighted combination of the aforementioned terms:


is the coefficient of .

Transfer Learning of Intra-class Variance

With the intra-class variance learned on the base set, we now illustrate the way we transfer it to the novel set images for feature augmentation.

Given an image of novel class , we can not only extract the class-specific feature but also get the distribution of intra-class variance and . For traditional few-shot learning methods, a new classifier is trained on the class-specific features of all novel images, typically one or five images for each category, by minimizing the softmax cross-entropy loss.


where is the label of , is the number of novel classes.

Method CUB NAB Stanford Dogs
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
Baseline [Chen et al.2019] 63.90 0.88 82.54 0.54 70.36 0.89 87.91 0.49 63.53 0.89 79.95 0.59
Baseline++ [Chen et al.2019] 68.46 0.85 81.02 0.46 76.00 0.85 90.99 0.41 58.30 0.35 73.77 0.68
MAML [Finn, Abbeel, and Levine2017] 71.11 1.00 82.08 0.72 80.08 0.93 88.87 0.54 66.56 0.66 79.32 0.35
MatchingNet [Vinyals et al.2016] 72.62 0.90 84.14 0.50 73.91 0.72 88.17 0.45 65.87 0.81 80.70 0.42
ProtoNet [Snell, Swersky, and Zemel2017] 71.57 0.89 86.37 0.49 73.60 0.83 89.72 0.41 65.02 0.92 83.69 0.48
RelationNet [Sung et al.2018] 70.20 0.84 84.28 0.46 67.41 0.82 85.47 0.43 59.38 0.79 79.10 0.37
MTL [Sun et al.2018] 73.31 0.92 82.29 0.51 78.69 0.78 87.74 0.34 54.96 1.03 68.76 0.65
Delta-encoder [Schwartz et al.2018] 73.91 0.87 85.60 0.62 79.42 0.77 92.32 0.59 68.59 0.53 78.60 0.78
MetaOptNet [Lee et al.2019] 75.15 0.46 87.09 0.30 84.56 0.46 93.31 0.22 65.48 0.49 79.39 0.25
Ours 81.31 0.83 91.48 0.39 88.62 0.73 95.22 0.32 76.24 0.87 88.00 0.47

Table 1: Few-shot classification accuracy on CUB  [Welinder et al.2010], NAB  [Van Horn et al.2015] and Stanford Dogs  [Khosla et al.2011] dataset. All experiments are from 5-way classification with a ResNet12 backbone. The best performance is indicated in bold.

Due to the lack of samples, there is a strong probability that the class-specific features are not representative of the class. In such case, the classifier tends to be biased towards certain category-irrelevant factors such as viewpoints while ignores those important parts for classification. To address this issue, we generate additional features by adding the class-specific features with a biased term sampled from the distribution of intra-class variance.


where is the augmented feature. is sampled from the posterior distribution .

By sampling from multiple times, we get multiple augmented features . Since the posterior distribution is learned from abundant base set images, the larger base set intra-class variance can be transferred to the novel set. Training with and jointly thus leads to a more robust classifier.

Another approach to augment feature representations is to add random noise to . However, the variations introduced by noise might not be aligned with the true distribution of intra-class variance found in the base categories. Such augmented features could be meaningless and even lead to a negative impact on the performance of the classifier.

Method CUB Stanford Dogs
1-shot 5-shot 1-shot 5-shot
MatchingNet [Vinyals et al.2016] 45.30 1.03 59.50 1.01 35.80 0.99 47.50 1.03
ProtoNet [Snell, Swersky, and Zemel2017] 37.36 1.00 45.28 1.03 37.59 1.00 48.19 1.03
RelationNet [Sung et al.2018] 58.99 0.52 71.20 0.40 43.29 0.46 55.15 0.39
MAML [Finn, Abbeel, and Levine2017] 58.13 0.36 71.51 0.30 44.84 0.31 58.61 0.30
adaCNN [Munkhdalai et al.2018] 56.76 0.50 61.05 0.44 42.16 0.43 54.12 0.39
CovaMNet [Luo et al.2019] 52.42 0.76 63.76 0.64 49.10 0.76 63.04 0.65
DN4 [Li et al.2019] 53.15 0.84 81.90 0.60 45.73 0.76 61.51 0.85
LRPABN [Huang et al.2019] 63.63 0.77 76.06 0.58 45.72 0.75 60.94 0.66
MattML [Zhu, Liu, and Jiang2020] 66.29 0.56 80.34 0.30 54.84 0.53 71.34 0.38
Ours 68.42 0.92 82.42 0.61 57.03 0.86 73.00 0.66

Table 2: Few-shot classification accuracy on CUB [Welinder et al.2010] and Stanford Dogs [Khosla et al.2011] dataset. All experiments are from 5-way classification with a Conv4 backbone. The best performance is indicated in bold.



We evaluate our method on three fine-grained image classification datasets: Caltech UCSD Birds (CUB) [Welinder et al.2010], North America Brids (NAB) [Van Horn et al.2015] and Stanford Dogs[Khosla et al.2011]. The CUB dataset contains 11,788 bird images. There are 200 bird species in total and the number of images per class is about 60. Following the setup introduced in [Welinder et al.2010], we sample the base classes from the 100 classes provided for training, and sample the novel set from the 50 classes provided for testing. The NAB dataset contains 48,527 bird images with 555 classes, which is four times larger than CUB. Similar to [Tsutsui, Fu, and Crandall2019]

we adopt a 2:1:1 training, validation and test set split. The Stanford Dogs dataset is a subset of the Imagenet dataset designed for fine-grained image classification, where 60, 30 and 30 categories are for training, validation and testing, respectively.

Implementation Details

The architecture of our feature extractor has two options, ResNet12 and Conv4. ResNet12 [He et al.2016] contains 4 Residual blocks. Each residual block is composed of 3 CONV layers with 3 3 kernels. A 2

2 max-pooling layer is applied at the end of each residual block. The dimensionality of the output feature map is

. The class-specific features are calculated by average-pooling the output of the ResNet12. The encoder consists of three Convolutional blocks followed by two fully-connected heads that output the and respectively. The decoder consists of a fully connected layer followed by three Convolutional blocks. In addition, we also adopt widely used Conv4 [Vinyals et al.2016, Snell, Swersky, and Zemel2017] as the feature extractor for fair comparison with other methods. Conv4 consists of 4 layers with 3

3 convolutions and 32 filters, followed by batch normalization (BN) , a ReLU nonlinearity, and 2

2 max-pooling. The dimensionality of the output feature map is . The flattened feature map is concatenated with a fully connected layer to obtain the class-specific feature, which is 640-dimensional. The structures of the encoder and the decoder for Conv4 backbone are the same as those of ResNet12 backbone.

The whole network is trained from scratch in an end-to-end manner. In the training stage, we use Adam optimizer [Kingma and Ba2015]

on all datasets with initial learning rate 0.001 . We train 100 epochs in total with a batch size of 16 and decline the learning rate by 0.1 at 40 and 80 epochs. For the weights in the learning objective function, we set

= 4 in eq. 9 and = 1 in eq.11 respectively. In the fine-tuning stage, we select 5 classes from the novel classes randomly. For each class, we pick instances as the support set and 16 instances for the query set for a -shot task. The extracted features of all support set images along with the augmented features are used to train a linear classifier for 100 iterations with a batch size of 4. For each extracted feature of support set image, we obtain five augmented features. The final results are averaged over 600 experiments. For data augmentation, we adopt random crop, horizontal flip and color jitting as in [Chen et al.2019]. The final size of input images is 84*84 .

Results and Analysis

Table 1 summarizes the 5-way classification accuracy of different methods with a ResNet12 backbone. The results are obtained by implementing the corresponding public code. It can be observed that our proposed method improves the previous methods by a large margin under both 1-shot and 5-shot settings on all three datasets. Compared with Delta-encoder [Schwartz et al.2018], another data augmentation based method, our proposed method achieves 7.40%, 9.20% and 7.65% performance gain for 1-shot setting and 5.88%, 2.90% and 9.40% performance gain for 5-shot setting on the three datasets, which are all quite significant. We can also conclude that our improvement on 1-shot setting is more remarkable than that on 5-shot setting. Since 1-shot setting is a more extreme case of data starvation, augmenting training data tends to be more effective.

Table 2 presents the 5-way accuracy on CUB and Stanford Dogs datasets using a Conv4 backbone. In addition to those well-known few-shot learning methods [Finn, Abbeel, and Levine2017, Vinyals et al.2016, Sung et al.2018, Snell, Swersky, and Zemel2017], we also compare our method with recently proposed state-of-the-art methods [Luo et al.2019, Li et al.2019, Huang et al.2019, Zhu, Liu, and Jiang2020]. Note that we do not show NAB results since some comparison methods do not have public code for re-implementation. Similarly, our proposed method achieves state-of-the-art performance under both 1-shot and 5-shot settings. Especially for the 1-shot setting, our method obtains 2.12% performance gain for CUB and 2.19% gain for Stanford Dogs over MattML, a newly proposed method that targets specifically at fine-grained few-shot visual recognition as well.

Figure 3: Visualization of features on CUB dataset using t-SNE. (a) Original class-specific features. (b) Augmented class-specific features. (c) Intra-class variance features. Different colors indicate different categories.
1-shot 5-shot 1-shot 5-shot 1-shot 5-shot
MetaIRNet [Tsutsui, Fu, and Crandall2019] 63.18 74.82 63.76 76.77 63.53 79.95
Delta-Encoder [Schwartz et al.2018] 58.23 82.67 76.02 82.87 76.22 85.17
Ours 75.26 83.17 79.07 87.59 78.34 89.30

Table 3: Few-shot classification accuracy on CUB  [Welinder et al.2010] dataset in 1-shot and 5-shot setting with different types of classifiers. The best performance is indicated in bold.

Ablation Studies

Increasing the number of augmented features

We have observed that synthesizing samples with our proposed method brings a significant performance boost compared to the baseline of using just the few provided samples. But are we actually generating meaningful samples aligned with the actual distribution of real images in the feature space? To validate the effectiveness of our augmented features, we evaluate the few-shot performace as the number of augmented features grow. We provide 1-shot accuracy on CUB and NAB with different numbers of augmented samples, ranging from 0 to 5 per class. Moreover, we compare the results of our method with generating additional features simply with gaussian noise.

Figure 4:

Few-shot accuracy with different numbers of augmented features on CUB and NAB dataset. 0 number of augmented features indicates the classifier trained only with the original features extracted from support set images, one for each class. The

number of augmented features is generated by adding the original features with a biased term sampled from the distribution of either intra-class variance or gaussian noise times.

As shown in Figure 4, few-shot recognition performance keeps improving as the number of augmented samples increases in general. For the CUB dataset, the best accuracy is achieved with 4 augmented samples per class, 20 augmented samples in total. For the NAB dataset, 5 augmented samples per class give the best accuracy, achieving 88.62%. Improvement with regards to the increase of number of augmented features suggests that the proposed variational inference approach learns meaningful intra-class variance effectively.

Our method also consistently outperforms augmenting features with gaussian noise, which demonstrates that the learnt intra-class variance is not akin to simple augmentation. It is also worth noting that generating more features augmented with noise does not bring performance improvement. Although adding noise can enlarge intra-class variance in theory, the variation introduced by simple noise distributions can not reflect the actual distribution of intra-class variance in real images.

Comparison to other data augmentation based methods

We compare our method with two other data augmentation based few-shot learning methods: MetaIRNet[Tsutsui, Fu, and Crandall2019] and Delta-Encoder[Schwartz et al.2018]. MetaIRNet uses a pretrained image generator to synthesize additional images, which are then combined with the original images so that the resulting ‘hybrid’ training images improve one-shot learning. Delta-Encoder

learns to synthesize transferable non-linear deformations between pairs of examples of seen classes and apply these deformations to the few provided samples of novel categories. Here we use the additional samples synthesized by both of these methods to train three types of classifiers, i.e., the K-nearest neighbors(KNN), Support Vector Machine (SVM) and Logistic Regression (LR), which are then used to classify novel images.

Comparisons between these methods and our method are shown in table 3. The superior performance of our method demonstrates that the augmented features obtained by our framework is beneficial for various types of classifiers. Note that for MetaIRNet [Tsutsui, Fu, and Crandall2019], the results in table 3

are lower than theirs since they pre-train the backbone on ImageNet while we do not for fair comparison.

Visualization of enlarged intra-class variance

To visualize the class-specific features as well as the intra-class variance, we plot them in 2d using t-SNE (see Figure 3). As can be seen from the figure, the original class-specific features are discriminative (Figure 2(a)). The augmented samples exhibit larger intra-class variance than the original ones, which will lead to a more robust classifier (Figure 2(b)). The intra-class variance (Figure 2(c)

) follows a uniform distribution across different categories, which validates our assumption that it can be transferred from the base set to the novel set for feature augmentation.


We have proposed an effective feature generation method via variational transfer learning to address the data scarcity problem in few-shot fine-grained classification. The generated features enlarge the intra-class variance for novel set images while preserving the class-specific attributes. The consistent performance improvement with the increase of the number of augmented samples suggests that the learned features are meaningful and nontrival. The higher accuracy compared with other data augmentation based methods further demonstrate the superiority of our method. While this work mainly focuses on few-shot recognition problems, a promising future direction is to apply the feature transfer idea to other data-starved or label-starved tasks.


  • [Alessandro and Stefano2018] Alessandro, A., and Stefano, S. 2018. Emergence of invariance and disentanglement in deep representations. In J. Mach. Learn. Res.
  • [Antreas Antoniou2018] Antreas Antoniou, Amos Storkey, H. E. 2018. Data augmentation generative adversarial networks. In arXiv preprint arXiv:1711.04340.
  • [Burgess et al.2018] Burgess, C. P.; Higgins, I.; Pal, A.; Matthey, L.; Watters, N.; Desjardins, G.; and Lerchner, A. 2018. Understanding disentangling in -vae. In

    arXiv: Machine Learning

  • [Chen and Liu2013] Chen, J., and Liu, X. 2013. Transfer learning with one-class data. In Pattern Recognition Letters.
  • [Chen et al.2018] Chen, R. T. Q.; Li, X.; Grosse, R.; and Duvenaud, D. 2018. Isolating sources of disentanglement in variational autoencoders. In arXiv preprint arXiv:1802.04942.
  • [Chen et al.2019] Chen, W.-Y.; Liu, Y.-C.; Kira, Z.; Wang, Y.-C. F.; and Huang, J.-B. 2019. A closer look at few-shot classification. In International Conference on Machine Learning(ICML).
  • [Finn, Abbeel, and Levine2017] Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning(ICML).
  • [Hariharan and Girshick2017] Hariharan, B., and Girshick, R. 2017. Low-shot visual recognition by shrinking and hallucinating features. In

    IEEE International Conference on Computer Vision (ICCV)

  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Higgins et al.2017] Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; and Lerchner, A. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations (ICLR).
  • [Huang et al.2019] Huang, H.; Zhang, J.; Zhang, J.; Xu, J.; and Wu., Q. 2019. Low-rank pairwise alignment bilinear network for few-shot fine-grained image classification. In arXiv preprint arXiv:1908.01313.
  • [Khosla et al.2011] Khosla, A.; Jayadevaprakash, N.; Yao, B.; and Fei-Fei, L. 2011. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization(FGVC), IEEE Conference on Computer Vision and Pattern Recognition(CVPR).
  • [Kim et al.2019] Kim, J.; Oh, T.-H.; Lee, S.; Pan, F.; and Kweon, I. S. 2019. Variational prototyping-encoder: One-shot learning with prototypical images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Kingma and Ba2015] Kingma, D. P., and Ba, J. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).
  • [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR).
  • [Lee et al.2019] Lee, K.; Maji, S.; Ravichandran, A.; and Soatto, S. 2019. Meta-learning with differentiable convex optimization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Li et al.2017] Li, Z.; Zhou, F.; Chen, F.; and Li, H. 2017. Meta-sgd: Learning to learn quickly for few-shot learning. In arXiv preprint arXiv:1707.09835.
  • [Li et al.2019] Li, W.; Wan, L.; Xu, J.; Huo, J.; Gao, Y.; and Luo., J. 2019. Revisiting local descriptor based image-toclass measure for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Lin et al.2018] Lin, X.; Duan, Y.; Dong, Q.; Lu, J.; and Zhou, J. 2018. Deep variational metric learning. In European Conference on Computer Vision(ECCV).
  • [Luo et al.2019] Luo, W.; Yang, X.; Mo, X.; Lu, Y.; Davis, L. S.; Li, J.; Yang, J.; and Lim., S.-N. 2019. Cross-x learning for fine-grained visual categorization. In IEEE International Conference on Computer Vision (ICCV).
  • [Munkhdalai et al.2018] Munkhdalai, T.; Yuan, X.; Mehri, S.; and Trischler, A. 2018.

    Rapid adaptation with conditionally shifted neurons.

    In International Conference on Machine Learning(ICML).
  • [Pan and Yang2009] Pan, S. J., and Yang, Q. 2009. A survey on transfer learning. In IEEE Transactions on Knowledge and Data Engineering (TKDE).
  • [Schonfeld et al.2019] Schonfeld, E.; Ebrahimi, S.; Sinha, S.; Darrell, T.; and Akata, Z. 2019. Generalized zero- and few-shot learning via aligned variational autoencoders. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Schwartz et al.2018] Schwartz, E.; Karlinsky, L.; Shtok, J.; Harary, S.; Marder, M.; Feris, R.; Kumar, A.; Giryes, R.; and Bronstein, A. M. 2018. Delta-encoder: an effective sample synthesis method for few-shot object recognition. In Advances in Neural Information Processing Systems (NeurIPS).
  • [Snell, Swersky, and Zemel2017] Snell, J.; Swersky, K.; and Zemel, R. S. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS).
  • [Sun et al.2018] Sun, Q.; Liu, Y.; Chua, T.-S.; and Schiele, B. 2018. Meta-transfer learning for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Sung et al.2018] Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H. S.; and Hospedales, T. M. 2018. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition(CVPR).
  • [Tsutsui, Fu, and Crandall2019] Tsutsui, S.; Fu, Y.; and Crandall, D. 2019. Meta-Reinforced Synthetic Data for One-Shot Fine-Grained Visual Recognition. In Advances in Neural Information Processing Systems (NeurIPS).
  • [Van Horn et al.2015] Van Horn, G.; Branson, S.; Farrell, R.; Haber, S.; Barry, J.; Ipeirotis, P.; Perona, P.; and Belongie, S. 2015. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Vinyals et al.2016] Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; and Wierstra, D. 2016. Matching networks for one shot learning. In Advances in Neural Information Processing Systems (NeurIPS).
  • [Wang et al.2018] Wang, Y.-X.; Girshick, R.; Hebert, M.; and Hariharan, B. 2018. Low-shot learning from imaginary data. In IEEE Conference on Computer Vision and Pattern Recognition(CVPR).
  • [Welinder et al.2010] Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; and Perona, P. 2010. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.
  • [Yin et al.2019] Yin, X.; Yu, X.; Sohn, K.; Liu, X.; and Chandraker, M. 2019.

    Feature transfer learning for deep face recognition with under-represented data.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  • [Zhang et al.2019] Zhang, J.; Zhao, C.; Ni, B.; Xu, M.; and Yang, X. 2019. Variational few-shot learning. In The IEEE International Conference on Computer Vision (ICCV).
  • [Zhu, Liu, and Jiang2020] Zhu, Y.; Liu, C.; and Jiang, S. 2020. Multi-attention meta learning for few-shot fine-grained image recognition. In

    Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence(IJCAI)