MCRNet
Complementing Representation Deficiency in Few-shot Image Classification: A Meta-Learning Approach
view repo
Few-shot learning is a challenging problem that has attracted more and more attention recently since abundant training samples are difficult to obtain in practical applications. Meta-learning has been proposed to address this issue, which focuses on quickly adapting a predictor as a base-learner to new tasks, given limited labeled samples. However, a critical challenge for meta-learning is the representation deficiency since it is hard to discover common information from a small number of training samples or even one, as is the representation of key features from such little information. As a result, a meta-learner cannot be trained well in a high-dimensional parameter space to generalize to new tasks. Existing methods mostly resort to extracting less expressive features so as to avoid the representation deficiency. Aiming at learning better representations, we propose a meta-learning approach with complemented representations network (MCRNet) for few-shot image classification. In particular, we embed a latent space, where latent codes are reconstructed with extra representation information to complement the representation deficiency. Furthermore, the latent space is established with variational inference, collaborating well with different base-learners, and can be extended to other models. Finally, our end-to-end framework achieves the state-of-the-art performance in image classification on three standard few-shot learning datasets.
READ FULL TEXT VIEW PDFComplementing Representation Deficiency in Few-shot Image Classification: A Meta-Learning Approach
Humans have extraordinary abilities to utilize the previous knowledge to quickly learn new concepts from limited information. In contrast, although deep learning methods have achieved great success in many fields, such as image classification, natural language processing, and speech modeling
[1, 2, 3, 4, 5, 6], these approaches tend to break down in the low-data regimes due to the lack of sufficient labeled data for training. With the aim to learn new knowledge or concepts from a limited number of examples, few-shot learning approaches have been proposed to address the problem above [7, 8, 9, 10, 11, 12].Meta-learning, as shown in Fig. 1, is one major kind of methods for few-shot learning recently. It contains a base-learner, and a meta-learner which adapts the base-learner to new tasks with few samples. The goal of meta-learning is accumulating “experience” to learn a prior over tasks and generalize to various new tasks with very few training data. However, a key challenge for it is representation deficiency that using few samples to effectively calculate gradients in a high-dimensional parameter space is intractable in practice [4]. This results in the difficulty of representing common features and training the meta-learner well to generalize to new tasks.
Recently, a number of studies [13, 14, 15, 16, 17] in meta-learning seek to represent features directly with less expressive feature extractors. These methods generally alleviate the tendency of over-fitting in few-shot setting and the difficulty of representing features [8, 18], caused by the representation deficiency, but at the expense of greater representation or expressiveness. Furthermore, the inherent task ambiguity [16]
in few-shot learning also leads to difficulties when scaling to high-dimensional data. Therefore, it is desirable to develop an approach to achieve both better representation and ambiguity awareness.
In this paper, inspired by the out-performance of variational inference in generating extra information [19, 20], we propose a novel end-to-end meta-learning approach with complemented representations network (MCRNet) for few-shot image classification. Fig. 2 overviews the proposed method. Specially, our approach embeds a latent space based on low-dimensional features learned from raw samples through encoding convolution layers. Then latent codes z, reconstructed in this space, are endued extra representation information to complement the representation deficiency and will be decoded in a high-dimensional parameter space for better representation. The gradient-based optimization is performed with the new loss in both latent space and high-dimensional parameter space. Besides, the latent space is established via variational inference with stochastic initialization, enabling the method to model the inherent task ambiguity in few-shot learning [16]. Our results demonstrate that MCRNet achieves state-of-the-art performance in classification tasks on three few-shot learning datasets including CIFAR-FS [21], FC100 [22], and miniImageNet [23, 14].
The main contributions of our work are threefold:
We propose an end-to-end framework and interpolate a latent space to endue the reconstructed latent codes with more information, complementing the representation deficiency in a high-dimensional parameter space.
The probabilistic latent space with stochastic initialization collaborates well with different base-learners and can be extended to other architecture with high-dimensional feature extractors in few-shot learning.
We optimize the framework leveraging new loss function for the proposed latent space, which acquires better generalization across tasks and achieves the state-of-the-art performance in few-shot learning classification tasks.
Adaptation to new information from limited knowledge is an important aspect of human intelligence, leading to the popularity of few-shot learning [13, 14, 15, 24, 25]
. Tackling the few-sample problems, one of the major few-shot learning methods is meta-learning, which aims to accumulate “experience” and adapt the “experience” to new tasks fast. Contemporary methods of meta-learning could be roughly classified into three categories. 1) Metric-based methods
[15, 14, 24], which learn a similarity space through training similar metrics over the same class to enhance effectiveness. 2) Memory-based methods [22, 26], which use memory architecture to store key “experience” from seen samples and generalize to unseen tasks according to the stored knowledge. 3) Optimization-based methods [13, 27, 28], which search for a suitable meta-learner that is conducive to fast gradient-based adaptation to new tasks. In this process, the meta-learner and base-leaner are continuously optimized in the “outer loop” and “inner loop”, and the optimization in our method is based on this concept.MAML [13] is an important model in meta-learning and has been extended to many variants recently. Making use of learning Gaussian posteriors over parameters to model task ambiguity, [16] and [17] propose probabilistic extensions to MAML. Unfortunately, these methods choose a less expressive architecture as a feature extractor, like 4CONV [13], to avoid the representation deficiency in a high-dimensional parameter space. [29] tries to propose better base-learners and [18] interpolates self-attention relation network. These two extract features directly via more powerful architecture, ResNet-12 [22] or ResNet-101, but neither of them addresses the deficiency mentioned. LEO[4], using WRN-28-10 [30], takes the deficiency into account. However, the relation network incorporated in it may lead to data-dependency issue and the potential of latent space may be neglected, since it fails to perform optimization in high-dimensional parameter space.
In contrast, we learn a probabilistic latent space over model parameters to generate latent codes with extra representation information and perform adaptation with new loss in it and the high-dimensional parameter space. This enables our method to complement the representation deficiency in few-shot learning intuitively.
Our main goal is embedding a latent space, established via variational inference, to complement the representation deficiency so as to learn better representations for meta-learning.
We evaluate the proposed approach in the -way, -shot problem for few-shot image classification referring to the episodic formulation of [14] and Fig. 1, where represents the number of classes for one task instance and represents the number of training samples per class. Typically, the value of is 1 or 5 and is 5.
The raw dataset is divided into 3 mutually disjoint meta-sets: meta-training set , meta-validation set , and meta-testing set , considering the model’s generalizability assessment. is used for model selection and is used for the final evaluation. Each task/episode is constructed during meta-training. As shown in Fig. 1, data for one task are sampled from as follows: the training/support set consists of classes selected from and samples per class. The test/query set is composed of other different samples belonging to the same classes in .
plays the role of providing an estimate of generalization and further optimizing the meta-learning objective in each task. Notably,
should not be confused with .Aiming to complement the representation deficiency, we apply a probabilistic scheme, based on [19], to establish a latent space and reconstruct latent codes with additional information in it.
To implement the parametric generative model, we first use several convolution layers as the encoder. Assuming that we acquire raw input , the latent code can be defined through a posterior distribution which is, however, usually hard to estimate. We therefore introduce with input data point and variational parameters to approximate this posterior distribution, as defined below:
(1) |
where the mean and standard deviation of the approximate posterior are denoted as
and, which are obtained through an encoding neural network:
(2) |
where the parameter group {
} contains the weights and biases of the neural network. We sample random variables from the posterior distribution
to reconstruct the latent code:(3) | |||||
subject to
where represents the production function of and denotes element-wise multiplication.
Besides, in the method, we assume that both and follow a Multivariate Gaussian:
(4) |
To enforce that, we compute the variation loss as follows:
(5) |
where the first term is Kullback-Liebler divergence, the second is reconstruction error and , respectively denote the -th mean and standard deviation of feature samples.
In our method, as shown in Fig. 3, latent codes z are reconstructed through combining the original features and additional representation information from the encoding neural network with variational inference. Meanwhile, extra information enables expressing the ambiguity in few-shot learning in a probabilistic manner [16].
For each task , the adaptation model produces a novel classifier , defined as the base-learner in meta-learning and optimized constantly. This process is performed on . The purpose of base-learner is to estimate parameters of predictor for classification tasks. We obtain the parameter through minimizing the empirical loss:
(6) |
where is a loss function defined as negative log-likelihood of labels in our method and is designed as a regularization term. is parameter of embedding module , as shown in Fig. 2.
As an integral part of meta-learning, the base-learner plays an important role and we choose the base-learners suggested in [29]
, considering that the objective of the base-learner in our method is also convex. There are two types of base-learners used in our method including Support Vector Machine (SVM) and Ridge Regression (RR), which are both based on multi-class linear classifiers and perform well as reported in
[29]. Specifically, a -class linear base-learner could be expressed as = . As a result, (6) is transformed into the Crammer and Singer formulation [31]:(7) | |||||
subject to
where , is designed for regularization and is the Kronecker delta function.
Given the decoder parameters and base-learner, we then compute the “inner loop” loss in each episode as follows:
(8) |
where is the i-th sample from . is a proportional optimizable parameter which has shown good performance in few-shot learning [21, 22] under conditions of SVM and RR as base-learners. Step 6 to 12 in Algorithm 1 has shown the “inner loop”.
After the “inner loop”, we obtain a preliminary optimized base-learner and then conduct the second step “outer loop” on . In this part, encoding and decoding parameter , will be updated to improve generalizability to unseen samples through gradient-based optimization. The “outer loop” is performed by minimizing the new loss consisting of two parts:
(9) |
where the first term is the deformation of (8) on and the second term uses a variation loss with optimizable weight to regularize the latent space which is defined in (5). Step 4 to 18 in Algorithm 1 has shown the “outer loop”.
Once finishing the meta-training, we evaluate the generalizability of model to unseen data tuple , which is sampled from . Hence, we compute the loss for evaluation during meta-testing as follows:
(10) |
We use the meta-testing set for evaluation, and the parameters of the model will not be updated in each episode .
method | backbone | CIFAR-FS | FC100 | ||
---|---|---|---|---|---|
1-shot | 5-shot | 1-shot | 5-shot | ||
Relation Networks [24] | 4CONV | 55.0 1.0 | 69.3 0.8 | - | - |
Prototypical Networks [15] | 4CONV | 55.5 0.7 | 72.0 0.6 | 35.3 0.6 | 48.6 0.6 |
MAML [13] | 4CONV | 58.9 1.9 | 71.5 1.0 | - | - |
R2D2 [21] | 4CONV | 65.3 0.2 | 79.4 0.1 | - | - |
Fine-tuning [32] | ResNet-12 | 64.66 0.73 | 82.13 0.50 | 37.52 0.53 | 55.39 0.57 |
TADAM [22] | ResNet-12 | - | - | 40.1 0.4 | 56.1 0.4 |
MTL [8] | ResNet-12 | - | - | 43.6 1.8 | 55.4 0.9 |
Baseline-RR [29] | ResNet-12 | 72.6 0.7 | 84.3 0.5 | 40.5 0.6 | 55.3 0.6 |
Baseline-SVM [29] | ResNet-12 | 72.0 0.7 | 84.2 0.5 | 41.1 0.6 | 55.5 0.6 |
MCRNet-RR (ours) | ResNet-12 | 73.8 0.7 | 85.2 0.5 | 40.7 0.6 | 56.6 0.6 |
MCRNet-SVM (ours) | ResNet-12 | 74.7 0.7 | 86.8 0.5 | 41.0 0.6 | 57.8 0.6 |
indicates that the method is not end-to-end.
Comparisons of average classification accuracy (%) with 95% confidence intervals on the CIFAR-FS and FC100. “SVM” or “RR” means using SVM or Ridge Regression as base-learner.
method | backbone | 1-shot | 5-shot |
---|---|---|---|
Meta-Learning LSTM [23] | 4CONV | 43.44 0.77 | 60.60 0.71 |
Matching networks [14] | 4CONV | 43.56 0.84 | 55.31 0.73 |
MAML [13] | 4CONV | 48.70 1.84 | 63.11 0.92 |
Prototypical Networks [15] | 4CONV | 49.42 0.78 | 68.20 0.66 |
Relation Networks [24] | 4CONV | 50.44 0.82 | 65.32 0.70 |
R2D2 [21] | 4CONV | 51.2 0.6 | 68.8 0.1 |
SRAN [18] | ResNet-101 | 51.62 0.31 | 66.16 0.51 |
DN4 [33] | ResNet-12 | 54.37 0.36 | 74.44 0.29 |
SNAIL [26] | ResNet-12 | 55.71 0.99 | 68.88 0.92 |
Fine-tuning [32] | ResNet-12 | 56.67 0.62 | 74.80 0.51 |
TADAM [22] | ResNet-12 | 58.50 0.30 | 76.70 0.30 |
CAML [34] | ResNet-12 | 59.23 0.99 | 72.35 0.71 |
TPN [35] | ResNet-12 | 59.46 | 75.65 |
wDAE-GNN [36] | WRN-28-10 | 61.07 0.15 | 76.75 0.11 |
MTL [8] | ResNet-12 | 61.2 1.8 | 75.5 0.8 |
LEO [4] | WRN-28-10 | 61.76 0.08 | 77.59 0.12 |
LEO [4] | ResNet-12 | 58.67 0.07 | 73.45 0.12 |
Baseline-RR [29] | ResNet-12 | 60.02 0.64 | 76.51 0.49 |
Baseline-SVM [29] | ResNet-12 | 60.73 0.65 | 76.16 0.49 |
MCRNet-RR (ours) | ResNet-12 | 61.32 0.64 | 78.16 0.49 |
MCRNet-SVM (ours) | ResNet-12 | 62.53 0.64 | 80.34 0.47 |
indicates those methods are not end-to-end. indicates those methods that are reproduced by ourselves for comparison of convergence.
We evaluate the proposed MCRNet for few-shot classification tasks on the unseen meta-testing set and compare it with the state-of-the-art methods.
CIFAR-FS is a new standard benchmark for few-shot learning tasks, consisting of 100 classes from CIFAR-100 [37]. These classes are randomly divided into 64, 16, and 20 classes for meta-training, meta-validation and meta-testing with 600 images of size 32 32 in each class.
FC100 is another dataset derived from CIFAR-100 and similar to CIFAR-FS. Differently, its 100 classes are grouped into 20 advanced classes and divided into 60 classes from 12 advanced classes for meta-training, 20 classes from 4 advanced classes for meta-validation, and 20 classes from 4 advanced classes for meta-testing. This dataset is intended to reduce the semantic similarity between classes.
MiniImageNet is a common dataset used for few-shot learning tasks with 100 classes randomly sampled from ILSVRC-2012 [38] and divided into meta-training, meta-validation, and meta-testing sets with 64, 16, and 20 classes respectively. Each class contains 600 images of size 84 84, and we choose commonly-used class split in [23].
(a) Comparison of convergence on miniImageNet. Accuracies are obtained on meta-validation set after each epoch. (b) Some examples of the representation output from baseline and MCRNet.
In a nutshell, our method consists of meta-learner (including an encoder and a decoder), and base-learner. The encoder is composed of several convolution layers and sampling architecture, which is utilized to produce latent codes. The dimension of the latent space is determined by the encoder. Currently, there are two popular network models for the decoder to learn high-dimensional features in meta-learning, 4CONV and ResNet-12. Both of them consist of 4 blocks with several convolutions, Batch Normalization and pooling layers, but differ in the number of filters in each block. We choose ResNet-12 as the decoder in our method, considering the poor performance of 4CONV for few-shot learning
[8]. For the base-learner, we utilize two classic linear methods, multi-class SVM and RR.In our experiment, the hyperparameters generally remain consistent with the baseline
[29]for fair comparisons. We apply stochastic gradient descent-based optimization and set the Nesterov momentum and weight decay to 0.9 and 0.0005, respectively. Totally, we meta-train the model for 60 epochs with 1000 episodes in each epoch. Besides, we set the learning rate to 0.1, 0.006, 0.0012, and 0.00024 at epochs 1, 20, 40, and 50 respectively.
Our experimental results on the datasets are averaged over 1000 test episodes and summarized in TABLE I and TABLE II. In general, compared to the baseline and current state-of-the-art methods, the accuracies of our model in 5-way 1- and 5-shot classification tasks both get improved.
As detailed in TABLE I, in 1-shot and 5-shot test, our method achieves 10.04% and 4.67% improvement over Fine-tuning [32] on CIFAR-FS dataset. On FC100, compared to the baseline, our method gets improved by 2.3% in 5-shot classification tasks. TABLE II illustrates our experiment on miniImageNet dataset. Our method beats the baseline accuracy by 1.80% and 4.18% in 1-shot and 5-shot test respectively. On miniImageNet, the proposed MCRNet still achieves 4.84% and 2.75% improvement over the competitive method MTL [8] and LEO [4] in 5-shot classification respectively.
Our method provides a solution to break the representation limit in a high-dimensional parameter space through complementing the representation deficiency, and further fulfills the potential of ResNet-12 in representation learning. Intuitively, it outperforms those with ResNet-12, especially on CIFAR-FS and miniImageNet. Besides, collaborating well with different base-learners, MCRNet outperforms existing models on different datasets, revealing the good applicability of our method.
Although the methods in [18, 4, 36, 8] use pre-trained or deeper feature extractors and TADAM [22] co-trains the feature extractor on 5- and 64-way classification tasks to tackle the representation limit in few-shot learning, our method still achieves improvement generally. Besides, our model is meta-trained end-to-end, allowing us to clearly observe the impact of latent space for meta-learning. Obviously, it is essential to complement the representation deficiency when using an expressive architecture in few-shot learning.
We reproduce the baseline [29] and the state-of-the-art LEO [4] with the provided code on miniImageNet using ResNet-12 and compare with them in terms of convergence rate. As shown in Fig. 4(a), the number of tasks required to reach convergence is almost the same as the baseline, but our method achieves higher accuracy. LEO needs a pre-trained raw embedding network. It adopts MAML as its training methods, and the relation network is embedded in it. Therefore, it is hard for LEO to train the whole model well and more iterations are needed to reach convergence. As shown in Table II, compared to the LEO with ResNet-12, the accuracies of our method get improved by 3.86% and 6.69% in 5 way 1- and 5-shot classification on miniImageNet respectively.
Fig. 4(b) shows some examples of representation output from baseline and our model. As we can see from the red box, our model is able to learn more details, thanks to the complemented representations. Compared to the baseline, the representation outputs generated from our model tend to perform better, such as higher brightness and sharper texture, which is beneficial to better generalization.
In regard to the influence of the latent space dimension on the model’s performance, as shown in TABLE III, we found that for a particular dataset and base-learner, there is a suitable latent space dimension that contributes to the improvement of generalizability. In our experiments, this value is mostly 64, which means the huge potential of our method in the high latent dimension. Remarkably, with dimension 64, our method achieves about 4% improvement in 5-way classification accuracy on miniImageNet. However, considering the interface with ResNet-12 and fair comparison, the dimension setting in our experiment is up to 64.
Latent Dimension | shot | CIFAR-FS | FC100 | miniImageNet | |||
---|---|---|---|---|---|---|---|
RR | SVM | RR | SVM | RR | SVM | ||
without MCRNet | 1 | 72.6 | 72.0 | 40.5 | 41.1 | 60.0 | 60.7 |
8 | 1 | 69.4 | 70.1 | 39.0 | 39.8 | 57.2 | 58.7 |
16 | 1 | 70.3 | 71.2 | 39.2 | 40.4 | 58.4 | 60.1 |
32 | 1 | 72.9 | 72.5 | 40.7 | 40.7 | 60.3 | 61.5 |
64 | 1 | 73.8 | 74.7 | 40.4 | 41.0 | 61.3 | 62.5 |
without MCRNet | 1 | 84.3 | 84.2 | 55.3 | 55.5 | 76.5 | 76.2 |
8 | 5 | 82.6 | 83.8 | 53.5 | 54.0 | 75.8 | 75.4 |
16 | 5 | 83.8 | 84.3 | 54.0 | 55.4 | 76.4 | 77.5 |
32 | 5 | 84.5 | 85.5 | 56.6 | 56.6 | 76.9 | 79.0 |
64 | 5 | 85.2 | 86.8 | 55.4 | 57.8 | 78.2 | 80.3 |
In this paper, we proposed an MCRNet for few-shot learning, which achieved state-of-the-art performance on the challenging 5-way 1- and 5-shot CIFAR and ImageNet classification problems. The MCRNet made use of a probabilistic framework to learn a latent space to complement the representation deficiency with extra representation information and broke the representation limit in a high-dimensional parameter space, resulting in better generalization across tasks. The experimental results also demonstrated the good performance of our method when applying to different base-learners.
This work was supported in part by Fundamental Research Funds for the Central Universities under Grant 191010001, and Hubei Key Laboratory of Transportation Internet of Things under Grant 2018IOT003, Department of Science and Technology of Hubei Provincial People’s Government under Grant 2017CFA012, and Ministry of Science and Technology of Taiwan under Grants MOST 108-2634-F-007-009.
Q. Sun, Y. Liu, T. S. Chua, and B. Schiele, “Meta-transfer learning for few-shot learning,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2019, pp. 403–412.E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata, “Generalized zero- and few-shot learning via aligned variational autoencoders,” in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 8247–8255.AAAI Conference on Artificial Intelligence (AAAI)
, 2020.International Conference on Machine Learning (ICML)
, 2017, pp. 1126–1135.T. Silver, K. R. Allen, A. K. Lew, L. P. Kaelbling, and J. Tenenbaum, “Few-shot bayesian imitation learning with logic over programs,” in
AAAI Conference on Artificial Intelligence (AAAI), 2020.S. Gidaris and N. Komodakis, “Generating classification weights with GNN denoising autoencoders for few-shot learning,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 21–30.