Diversity Transfer Network for Few-Shot Learning (AAAI-20, oral presentation)
Few-shot learning is a challenging task that aims at training a classifier for unseen classes with only a few training examples. The main difficulty of few-shot learning lies in the lack of intra-class diversity within insufficient training samples. To alleviate this problem, we propose a novel generative framework, Diversity Transfer Network (DTN), that learns to transfer latent diversities from known categories and composite them with support features to generate diverse samples for novel categories in feature space. The learning problem of the sample generation (i.e., diversity transfer) is solved via minimizing an effective meta-classification loss in a single-stage network, instead of the generative loss in previous works. Besides, an organized auxiliary task co-training over known categories is proposed to stabilize the meta-training process of DTN. We perform extensive experiments and ablation studies on three datasets, i.e., miniImageNet, CIFAR100 and CUB. The results show that DTN, with single-stage training and faster convergence speed, obtains the state-of-the-art results among the feature generation based few-shot learning methods. Code and supplementary material are available at: https://github.com/Yuxin-CV/DTNREAD FULL TEXT VIEW PDF
Diversity Transfer Network for Few-Shot Learning (AAAI-20, oral presentation)
Deep neural networks (DNNs) have shown tremendous success in solving many challenging real-world problems when a large amount of training data is available[10, 24, 8]. Common practice suggests that models with more parameters have the greater capacity to fit data and more training data usually provide better generalization ability. However, DNNs struggle to generalize given only a few training data while humans excel at learning new concepts from just a few examples . Few-shot learning has therefore been proposed to close the performance gap between machine learner and human learner. In the canonical setting of few-shot learning, there are a training set (seen, known) and a testing set (unseen, novel) with disjoint categories. Models are trained on the training set while tested in an -way -shot scheme  where the models need to classify the queries into one of the categories correctly when only samples of each novel category are given. This unique setting of few-shot learning poses an unprecedented challenge in fully utilizing the prior information in the training set , which corresponds to the known information or historical information of the human learner. Common approaches to address this challenge either learn a good metric for novel tasks [25, 27, 26] or train a meta-learner for fast adaptation [5, 16, 21].
Recently, the generation based approach is becoming an effective solution for few-shot learning [7, 23, 29, 32], since it directly alleviates the problem of lacking training samples. We propose a Diversity Transfer Network (DTN) for sample generation. In DTN, the offset between a random sample pair from the known category is composited with a support sample in the novel category in the latent feature space. Then, the generated features, as well as the support features, are averaged as the proxy of the novel category. At last, query samples are evaluated by the proxy. Only if the generated samples follow the distribution of the real samples to be diverse, can the meta-classifier (i.e., the proxy) be robust enough to classify queries correctly.
In addition to the new sample generation scheme, we utilize an effective meta-training curriculum called OAT (Organized Auxiliary task co-Training), inspired by the auxiliary task co-training in TADAM  and curriculum learning . OAT organizes auxiliary tasks and meta-tasks reasonably and effectively reduces training complexity. Experiments show that by applying OAT, our DTN converges much faster compared with the naïve meta-training strategy (i.e., meta-training from scratch), the multi-stage training strategy used in -encoder  and the auxiliary task co-training strategy used in TADAM.
The main components of DTN are integrated into a single network and can be optimized in an end-to-end fashion. Thus, DTN is very simple to implement and easy to train. Our experimental results show that this simple method outperforms many previous works on a variety of datasets.
Metric learning is the most common and straightforward solution for few-shot learning. An embedding function can be learned by a myriad of instances of known categories. Then some simple metrics, such as Euclidean distance  and cosine distance [27, 20, 19], are used to build nearest neighbor classifiers for instances in unseen categories. Furthermore, to model the contextual information among support images and query images, bidirectional LSTM and attention mechanism are adopted in Matching Network . Besides measuring the distances of a query to its support images, there is a new solution that compares the query to the center of the support images of each class in feature space, such as snell2017prototypical,Qiao_2018_CVPR,Qi_2018_CVPR,Gidaris_2018_CVPR. The center is usually termed as a proxy of the class. Specifically, squared Euclidean distance is used in Prototypical Network , and cosine distance is used in the other works. snell2017prototypical,Qi_2018_CVPR directly calculate proxies by averaging the embedding features, while Qiao_2018_CVPR,Gidaris_2018_CVPR take a small network to predict proxies. Based on Prototypical Network, TADAM  further proposes a dynamic task conditioned feature extractor by predicting the layer-level element-wise scale and shift vectors for each convolutional layer. Different from simple metrics, Relation Network  takes the neural network as a non-linear metric and directly predicts the similarities between the query and support images. TPN  performs transductive learning on the similarity graph contains both query and support images to obtain high-order similarities.
Meta-learning approaches have been widely used in few-shot learning scenarios by optimization learning for fast adaptation, aiming to learn a meta-learner that can solve the novel task quickly. Meta Network  and adaResNet 
are memory-based methods. Example and task level information in Meta Network are preserved in fast and slow weights, respectively. AdaResNet performs rapid adaptation by mimicking conditionally shifted neurons which modify activation values with task-specific shifts retrieved from a memory module. An LSTM-based update rule of the parameters of a classifier is proposed in ravi2017, where both short-term knowledge within a task and long-term knowledge common among all the tasks are learned. MAML, LEO  and MT-net  all differentiate through gradient update steps to optimize performance after fine-tuning. While MAML operates directly in high dimensional parameter space, LEO performs meta-learning within a low-dimensional latent space. Different from MAML that assumes a fixed model, MT-net chooses a subset of its weights to fine-tune. pmlr-v80-franceschi18a propose a method based on bi-level programming that unifies gradient-based hyper-parameter optimization and meta-learning.
Sample synthesis using the generative models has recently emerged as a popular direction for few-shot learning [33, 6]. How to synthesize new samples based on a few examples remains an interesting open problem. AGA  and FATTEN 
are attribute-guided (w.r.t. pose and depth) augmentation methods in feature space by leveraging a corpus with attribute annotations. Hariharan_2017_ICCV tries to transfer transformations from a pair of examples from a known category to a “seed” example of a novel class. Finding specific generation targets requires a carefully designed pipeline with heuristic steps.-encoder  also tries to extract intra-class deformations between image pairs sampled from the same class. Wang_2018_CVPR proposes to generate samples by adding random noises to support features. Different from previous methods, MetaGAN  generates fake samples that need to be discriminated by the classifier instead of augmentation, which sharpens the decision boundaries of novel categories.
Our proposed DTN shares a philosophical similarity with image hallucination  and -encoder  with distinct differences in the following aspects. The first difference is that DTN does not require to set specific target points for the generator. More specifically, -encoder takes a pair of images and from the same class and learns to infer the diversity between them by reconstructing . The image hallucination method collects quadruples for training based on clustering and traversal; each quadruple contains two image pairs from two classes and ; a generation network is trained to predict a sample from the quadruple when the rest three are given as input. Then, synthesized samples are used to train a linear classifier. The input of the generator in DTN is also a triplet as Hariharan_2017_ICCV, but the generated sample
is used directly to construct the meta-classifier, and the generator is optimized by minimizing the meta-classification loss instead of setting specific generation targets. Secondly, DTN integrates feature extraction, feature generation, and meta-learning into a single network and enjoys the simplicity and effectiveness of end-to-end training, while Hariharan_2017_ICCV,NIPS2018_7549 are stage-wise methods.
More recent work based on sample generation and data augmentation are IDeMe-Net  and SalNet . The former utilizes an additional deformation sub-network with a large number of parameters to synthesize diverse deformed images, the latter needs to pre-train a saliency network on the MSRA-B dataset. In contrast to these approaches, our method is based on a simple diversity transfer generator that learns a better proxy of each category with fewer parameters and faster convergence speed. Besides, our method can be regarded as an instance of compositional learning  in the latent feature space.
Different from the conventional classification task, where the training set and the testing set consist of samples from the same classes, few-shot learning aims to address the problem where the label spaces are disjoint between and . We follow the standard -way -shot classification scenario defined in vinyals2016matching to study the few-shot learning problem. An -way -shot task is termed as an episode. An episode is formed by classes sampled from the trainingtesting set firstly. Then images sampled from each of the classes constitute the support set , where and . For the sake of simplicity, we take -way -shot (i.e., ) classification for example in the following sections, and the support set will be simplified to . The query sample is sampled from the rest images of the classes. The goal is to classify the query into one of the classes correctly based only on the support set and the prior meta-knowledge learned from the training set .
The overall structure of the Diversity Transfer Network (DTN) is shown in Fig. 1. DTN contains four modules and is organized into two task branches. The task branch indicated by orange arrows is the meta-task, which is trained in a meta-learning way. The input for the meta-task consists of the following three parts: support images , a query image and reference images , where , , . All images are mapped to -normalized feature vectors by a feature extractor , where . and are feature vectors of two reference images. They come from the same category and make up a reference pair. The diversity of the pair is transferred to the support feature to generated a new feature by the feature generator . The generated feature is supposed to belong to the same category with . For each support feature, there are samples generated based on it. Since a meta-task is an -way -shot image classification task, the meta-classifier is an -way classifier consisting of a weight matrix and a trainable temperature . The values in the are determined by the proxies formed by support features and features generated by them. The meta-classifier is differentiable, so the feature extractor and feature generator1 is the auxiliary task, aiming to accelerate and stabilize the training of DTN. It is a conventional classification task over all categories of the training set .
Each image is mapped to a feature vector by the feature extractor . , and are feature vectors of the query image , the support image and the reference images pair respectively, where , . For a specific support feature , during both meta-training and meta-testing phase, the reference image pairs are always sampled from the training set (seen, known). Specifically, we first randomly sample classes from the whole training classes with replacement. For each sampled class, we then randomly sample two different images and to form a reference pair. We do not sample any images from (unseen, novel) during the whole process. The conventional few-shot evaluation setting, termed as -way -shot setting, requires to get a -way classifier with the support of only samples for each novel class and the prior meta-knowledge from the whole training set . Therefore, our sampling method strictly complies with the few-shot learning protocol.
As shown in Fig. 2, the feature generator of DTN consists of two mapping functions and . Three input features are firstly mapped into a latent space , . The elementwise difference measures the diversity between the two reference features. It is applied to the support feature by a simple linear combination . After mapping it by , we get a feature which has the same size of the input and should belong to the same category with the support feature . More specifically:
Given different reference pairs for a single support feature , there will be generated features that enrich the diversity of category . They are helpful to construct a more robust classifier for unseen categories. When , each of the support samples is taken as a “seed” and samples are generated based on it. Therefore, there will be support samples and generated samples for each novel category.
|Methods||Ref.||Backbone||-way -shot||-way -shot|
|Matching Network (vinyals2016matching)||NeurIPS’16||64-64-64-64|
|Meta-Learn LSTM (ravi2017)||ICLR’17||64-64-64-64|
|Prototypical Network (snell2017prototypical)||NeurIPS’17||64-64-64-64|
|Relation Network (sung2017learning)||CVPR’18||64-96-128-256|
|SalNet Intra-class Hal. (Zhang_2019_CVPR)||CVPR’19||ResNet-101|
|Deep DTN (Ours)||ResNet-12|
|Generation based approaches||Using a deformation sub-network||Using a saliency network pre-trained on MSRA-B|
The meta-task branch of DTN is shown in Fig. 1 indicated by orange arrows. The orange solid arrows and dashed arrows indicate the process of meta-training and meta-testing, respectively. Each image is mapped to a feature vector . Similar to Qiao_2018_CVPR,Gidaris_2018_CVPR,Qi_2018_CVPR, all the features here are -normalized vectors (i.e., ). The support feature and all the reference feature pairs are fed into the generator to generate new features ( is set to in Fig. 1 for example). So we get features for the -th category. The meta-task is an -way classification task, therefore the meta-classifier is represented by a matrix , in which each row can be viewed as a proxy  of the -th category. After obtaining all the features for category , the -th row of , termed as averaged proxy, is the -normalized average of those features:
All the averaged proxies are also -normalized vectors, so that the meta-classifier essentially becomes a cosine-similarity based classification model. After constructing the meta-classifier, the -normalized query feature is fed into it for evaluation. The prediction is the combination of classification scores of each category. To further increase stability and robustness when dealing with a large number of categories, we adopt a learnable temperature in our meta-task loss as Qi_2018_CVPR, where is updated by back-propagation during training. The meta-task loss can be defined as follow:
In order to accelerate the convergence of training and get better generalization ability, the meta-learning network in DTN is jointly trained with an auxiliary task. The auxiliary task is a conventional classification for all categories in . It shares the same feature extractor with the meta-task branch. Different from the meta-classifier , which consists of the averaged proxies, the auxiliary classifier after the feature extractor are randomly initialized and updated via back-propagation. The mini-batch is randomly sampled from the training set , where , and is the batch size. The auxiliary task loss has the same form as the meta-task loss :
where is one of the training features in the mini-batch, is the -th row of , and is learnable.
In TADAM 
, the auxiliary task is sampled with a probability that is annealed exponentially. We observe some positive effects from this training strategy compared with naïve meta-training and multi-stage training in our DTN.
However, the inadequacy of this approach is: the randomness in both the frequency and the order of the two tasks affects the final result to some extent, and the distribution of auxiliary tasks are unpredictable
rather than annealed exponentially, especially when the number of training epochs is not very large. Another problem brought by the randomness is that it is hard to determine the training schedule, e.g., the learning rate, the number of training epochs, etc., since the permutation of auxiliary tasks and meta-tasks varies according to the random seed. We empirically find that the stochastic auxiliary task co-training strategy used in TADAM results in a large fluctuation in the meta-classification accuracy (over, see Table 3
for details) when using different random seeds. This randomness makes the choice of hyperparameters as well as the training schedule more difficult.
Therefore, we propose the OAT (Organized Auxiliary task co-Training) strategy, which organizes auxiliary tasks and meta-tasks in a more orderly and more reasonable manner. More specifically, there are two kinds of training epochs: the auxiliary training epoch and the meta-training epoch . We select training epochs to form one training unit , the -th training unit has meta-training epochs, and auxiliary training epochs. The array of is denoted as , where is the total number of training units. Then the total number of training epochs is , and the whole training sequence can be expressed as follow:
By changing and , we can obtain different training sequences arranged in different frequency and order, which is proven to be more manageable and effective compared with the training strategy used in TADAM. Intuitively, we would like to gradually add harder few-shot classification tasks into a series of simpler auxiliary classification tasks. Therefore the setting of and is quite simple and straightforward. We choose and for training DTN, though a more careful scheduling may achieve better performance. Therefore the whole training sequence is organized as follow:
|Meta-Learn LSTM (ravi2017)||-|
|Matching Network (vinyals2016matching)|
|Deep DTN (Ours)|
Initially, the auxiliary tasks could be considered as a simpler curriculum, later they bring regularization effects to meta-tasks. Ablation studies show that compared with the training strategy used in TADAM, DTN trained by OAT obtains better and more robust results with a faster convergence speed.
Dataset. The proposed method is evaluated on multiple datasets: miniImageNet, CIFAR100 and CUB. The miniImageNet dataset has been widely used by few-shot learning since it is firstly proposed by vinyals2016matching. There are , and classes for training, validation, and testing respectively. The hyper-parameters are optimized on the validation set. After that, it will be merged into the training set for the final results. The CIFAR100 dataset  contains 6000 images of 100 classes. We use 64, 16, and 20 classes for training, validation, and testing, respectively. The CUB dataset  is a fine-grained dataset from 200 categories of birds. It is divided into training, validation, and testing sets with 100, 50, and 50 categories respectively. The splits of CIFAR100 and CUB follow NIPS2018_7549.
Architectures. The feature extractor for DTN is a CNN with convolutional modules. Each module contains amax-pooling layer. The structure of feature extractor is the same as those in former methods, e.g., vinyals2016matching,snell2017prototypical for fair comparisons. Many other works also use deeper networks for feature extraction to achieve better accuracy, e.g., pmlr-v80-munkhdalai18a,NIPS2018_7352. To make a comparison with them, we also implement our algorithm with ResNet-12 architecture. The output of the feature extractor is a -dimensional vector. The mapping function in the feature generator is a fully-connected (FC) layer with units followed by a leaky ReLU activation layer, and a dropout layer with dropout rate. The mapping function has the same settings with except that the number of units of the FC layer is .
|Random seed||Training sequence of AT (number of total training epochs = )||Results of AT||Training sequence of OAT (number of total training epochs = )||Results of OAT|
|-way -shot||-way -shot||-way -shot||-way -shot|
|Methods||-way -shot||-way -shot|
|1||Gaussian noise generator|
|3||DTN w/ two-stage training|
|4||DTN w/ OAT (Ours)|
Our reimplementation, which outperforms the original.
|Methods||-way -shot||-way -shot|
|1||DTN w/ naïve meta-training|
|2||DTN w/ two-stage training|
|3||DTN w/ AT|
|4||DTN w/ OAT (Ours)|
Quantitative Results. Table 1 provides comparative results on the miniImageNet dataset. All these results are reported with confidence intervals following the setting in vinyals2016matching. Under the 4-CONV feature extractor setting, our approach significantly outperforms the previous state-of-the-art works, especially in the -way
-shot task. As for the comparisons with models using deep feature extractor, deep DTN also surpasses other alternatives in the-way -shot scenario and achieves very competitive results under the -way -shot setting. The results confirm that our feature generation method is extremely useful to address the problem of learning with scarce data, i.e., the -way -shot scenario. DTN is also one of the simplest and lightweight feature generation methods which learns to enrich intra-class diversity, and does not rely on any extra information from other datasets, such as the salient object information in Zhang_2019_CVPR.
Table 2 shows that DTN also gets large improvements on the CIFAR100 and CUB datasets compared with existing state-of-the-arts in both -way -shot task and -way -shot task, which confirms DTN is generally useful for different few-shot learning scenarios.
Visualization Results. To better understand the results, Fig. 3 shows tSNE  visualizations of generated samples, support samples and real samples. It can be seen that our method can greatly enrich the diversity of an unseen class with only a single or a few support examples given. Most of the generated samples fit the distribution of real samples, which means that the category information of each support sample is well preserved by the generated sample, and they are close to the center of the real distribution even when the support sample lies on the edge. From the diagrams of -way -shot learning, it can be seen that generated features from support samples can cover the major distribution of the real samples, which facilitates to build a more robust classifier for unseen classes.
In this section, we study the impact of the feature generator, the training strategy and the number of generated features. We conduct the following ablation studies using models with deep feature extractor on miniImageNet. All the results are summarized in Table 3, Table 4, Table 5 and Table 6.
First, we make a comparison between different feature generators. For the sake of fairness, we use exactly the same meta-classifier (cosine-similarity based classifiers) and the same training strategy (two-stage training), only the feature generators are different. All models are trained until convergence. Experiments show that diversity transfer generator outperforms Gaussian noise seeded generator (by in -way -shot, in -way -shot. Table 4, Row and Row ) and -encoder (by in -way -shot, in -way -shot. Table 4, Row and Row ).
Second, we study the effects of different training strategies. Obviously, OAT (Table 5, Row ) surpasses the naïve meta-training (Table 5, Row ) and two-stage training (Table 5, Row ). As mentioned before, a large fluctuation was observed (e.g., from to for -way -shot, from to for -way -shot, see Table 3 for details) in the meta-classification accuracy if the DTN is trained with auxiliary task sampled via a probability in 30 epochs. The result becomes better and more stable if we increase the total number of training epochs to 60 (Table 5, Row ), but this is still worse than the result obtained via DTN trained with OAT in only 30 epochs (Table 5, Row ). A comparison in the result’s fluctuation between two training strategies is detailed in Table 3.
Finally, in Table 6 we study the impact on the number of generated features. The results gradually become better as the number of generated features increases. No improvement was observed when the number of generated features exceeds 64. We attribute this to the fact that 64 generated features have been well fitted to the real sample distribution.
In this work, we propose a novel generative model, Diversity Transfer Network (DTN), for few-shot image recognition. It learns transferable diversity from the known categories and augments the unseen category with the sample generation. DTN achieves competitive performance on three benchmarks. We believe that the proposed generative method can be utilized in various problems challenged by the scarcity of supervision information, e.g., semi-supervised learning, active learning and imitation learning. These interesting research directions will be explored in the future.
This work was supported by National Natural Science Foundation of China (NSFC) (No. 61733007, No. 61572207 and No. 61876212), National Key R&D Program of China (No. 2018YFB1402600) and HUST-Horizon Computer Vision Research Center.
International Conference on Machine Learning (ICML), Cited by: Introduction, Organized Auxiliary Task Co-training.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Generation Based Approaches.
ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems (NIPS), Cited by: Introduction.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In International Conference on Computer Vision (ICCV), Cited by: Generation Based Approaches.